Class RSSConnector
- java.lang.Object
-
- org.apache.manifoldcf.core.connector.BaseConnector
-
- org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
-
- org.apache.manifoldcf.crawler.connectors.rss.RSSConnector
-
- All Implemented Interfaces:
org.apache.manifoldcf.core.interfaces.IConnector,org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
public class RSSConnector extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorThis is the RSS implementation of the IRepositoryConnector interface. This connector basically looks at an RSS document in order to seed the document queue. The document is always fetched from the same URL (it's specified in the configuration parameters). The documents subsequently crawled are not scraped for additional links; only the primary document is ingested. On the other hand, redirections ARE honored, so that various sites that use this trick can be supported (e.g. the BBC)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classRSSConnector.CanonicalizationPoliciesClass representing a list of canonicalization rulesprotected static classRSSConnector.CanonicalizationPolicyClass representing a URL regular expression match, for the purposes of determining canonicalization policyprotected static classRSSConnector.EvaluatorTokenEvaluator token.protected static classRSSConnector.EvaluatorTokenStreamToken stream.protected classRSSConnector.FeedAuthorContextClassprotected classRSSConnector.FeedContextClassprotected classRSSConnector.FeedItemContextClassprotected static classRSSConnector.FilterClass that handles parsing and interpretation of the document specification.protected static classRSSConnector.MappingRuleClass representing a mapping ruleprotected static classRSSConnector.MappingRulesClass that represents all mappingsprotected static classRSSConnector.NameValueName/value classprotected classRSSConnector.OuterContextClassThis class handles the outermost XML context for the feed document.protected classRSSConnector.RDFContextClassprotected classRSSConnector.RDFItemContextClassprotected classRSSConnector.RSSChannelContextClassprotected classRSSConnector.RSSContextClassprotected classRSSConnector.RSSItemContextClassprotected static classRSSConnector.ThrottleSpecThe throttle specification class.protected classRSSConnector.UrlsetContextClassprotected classRSSConnector.UrlsetItemContextClass
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidstatic java.lang.StringACTIVITY_FETCHstatic java.lang.StringACTIVITY_PROCESSstatic java.lang.StringACTIVITY_ROBOTSPARSEprotected static DataCachecachestatic intCHROMED_METADATA_ONLYChromed suppression mode - index metadata only if dechromed content not availablestatic intCHROMED_SKIPChromed suppression mode - skip documents if dechromed content not availablestatic intCHROMED_USEChromed suppression mode - use chromed content if dechromed content not availablestatic intDECHROMED_CONTENTDechromed content mode - content fieldstatic intDECHROMED_DESCRIPTIONDechromed content mode - description fieldstatic intDECHROMED_NONEDechromed content mode - noneprotected ThrottledFetcherfetcherThe throttled fetcher used by this instanceprotected static java.util.Map<java.lang.String,ThrottledFetcher>fetcherMapStorage for fetcher objectsprotected java.lang.StringfromThe email address for this connector instanceprotected booleanisInitializedFlag indicating whether session data is initializedprotected intmaxOpenConnectionsPerServerThe maximum open connectionsprotected doubleminimumMillisecondsPerBytePerServerThe minimum milliseconds between bytesprotected longminimumMillisecondsPerFetchPerServerThe minimum milliseconds between fetchesprotected java.lang.StringproxyAuthDomainProxy auth domainprotected java.lang.StringproxyAuthPasswordProxy auth passwordprotected java.lang.StringproxyAuthUsernameProxy auth usernameprotected java.lang.StringproxyHostThe proxy hostprotected intproxyPortThe proxy portprotected RobotsrobotsThe robots object used by this instanceprotected static intROBOTS_ALLprotected static intROBOTS_DATAprotected static intROBOTS_NONEprotected static java.util.MaprobotsMapStorage for robots objectsprotected introbotsUsageRobots usage flagprotected static java.lang.StringrssThrottleGroupTypeprotected java.lang.StringthrottleGroupNameThe throttle group nameprotected static java.util.MapunderstoodProtocolsprotected java.lang.StringuserAgentThe user-agent for this connector instanceprotected static java.util.Set<java.lang.String>xmlContentTypes-
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
-
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
GLOBAL_DENY_TOKEN, JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_CHAINED_ADD, MODEL_CHAINED_ADD_CHANGE, MODEL_CHAINED_ADD_CHANGE_DELETE, MODEL_PARTIAL
-
-
Constructor Summary
Constructors Constructor Description RSSConnector()Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringaddSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.core.interfaces.Specification spec, java.lang.String lastSeedVersion, long seedTime, int jobMode)Queue "seed" documents.java.lang.Stringcheck()Check status of connection.protected static voidcompileList(java.util.List<java.util.regex.Pattern> output, java.util.List<java.lang.String> input)Compile all regexp entries in the passed in list, and add them to the output list.voidconnect(org.apache.manifoldcf.core.interfaces.ConfigParams configParams)Connect.voiddisconnect()Close the connection.protected static java.lang.StringdoCanonicalization(RSSConnector.CanonicalizationPolicy p, WebURL url)Code to canonicalize a URL.java.lang.String[]getActivitiesList()Return the list of activities that this connector supports (i.e.java.lang.String[]getBinNames(java.lang.String documentIdentifier)Get the bin name string for a document identifier.intgetConnectorModel()Tell the world what model this connector uses for getDocumentIdentifiers().protected ThrottledFetchergetFetcher()Given the current parameters, find the correct throttled fetcher object (or create one if not there).intgetMaxDocumentRequest()Get the maximum number of documents to amalgamate together into one batch, for this connector.protected RobotsgetRobots(ThrottledFetcher fetcher)Given the current parameters, find the correct robots object (or create one if none found).protected voidgetSession()Establish a sessionprotected static voidhandleIOException(java.io.IOException e, java.lang.String context)protected voidhandleRSSFeedSAX(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, RSSConnector.Filter filter)Handle an RSS feed document, using SAX to limit the memory impactprotected static java.lang.StringmakeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies, java.lang.String parentIdentifier, java.lang.String rawURL)Convert an absolute or relative URL to a document identifier.voidoutputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)Output the configuration body section.voidoutputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray)Output the configuration header section.voidoutputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName)Output the specification body section.voidoutputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray)Output the specification header section.voidpoll()This method is periodically called for all connectors that are connected but not in active use.java.lang.StringprocessConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)Process a configuration post.voidprocessDocuments(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses, org.apache.manifoldcf.core.interfaces.Specification spec, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int jobMode, boolean usesDefaultAuthority)Process a set of documents.java.lang.StringprocessSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)Process a specification post.protected static java.util.List<java.lang.String>stringToArray(java.lang.String input)Read a string as a sequence of individual expressions, urls, etc.voidviewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)View configuration.voidviewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)View specification.-
Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
getFormCheckJavascriptMethodName, getFormPresaveCheckJavascriptMethodName, getRelationshipTypes, requestInfo
-
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
clearThreadContext, deinstall, getConfiguration, install, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
-
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
rssThrottleGroupType
protected static final java.lang.String rssThrottleGroupType
- See Also:
- Constant Field Values
-
ROBOTS_NONE
protected static final int ROBOTS_NONE
- See Also:
- Constant Field Values
-
ROBOTS_DATA
protected static final int ROBOTS_DATA
- See Also:
- Constant Field Values
-
ROBOTS_ALL
protected static final int ROBOTS_ALL
- See Also:
- Constant Field Values
-
DECHROMED_NONE
public static final int DECHROMED_NONE
Dechromed content mode - none- See Also:
- Constant Field Values
-
DECHROMED_DESCRIPTION
public static final int DECHROMED_DESCRIPTION
Dechromed content mode - description field- See Also:
- Constant Field Values
-
DECHROMED_CONTENT
public static final int DECHROMED_CONTENT
Dechromed content mode - content field- See Also:
- Constant Field Values
-
CHROMED_USE
public static final int CHROMED_USE
Chromed suppression mode - use chromed content if dechromed content not available- See Also:
- Constant Field Values
-
CHROMED_SKIP
public static final int CHROMED_SKIP
Chromed suppression mode - skip documents if dechromed content not available- See Also:
- Constant Field Values
-
CHROMED_METADATA_ONLY
public static final int CHROMED_METADATA_ONLY
Chromed suppression mode - index metadata only if dechromed content not available- See Also:
- Constant Field Values
-
robotsUsage
protected int robotsUsage
Robots usage flag
-
userAgent
protected java.lang.String userAgent
The user-agent for this connector instance
-
from
protected java.lang.String from
The email address for this connector instance
-
minimumMillisecondsPerFetchPerServer
protected long minimumMillisecondsPerFetchPerServer
The minimum milliseconds between fetches
-
maxOpenConnectionsPerServer
protected int maxOpenConnectionsPerServer
The maximum open connections
-
minimumMillisecondsPerBytePerServer
protected double minimumMillisecondsPerBytePerServer
The minimum milliseconds between bytes
-
throttleGroupName
protected java.lang.String throttleGroupName
The throttle group name
-
proxyHost
protected java.lang.String proxyHost
The proxy host
-
proxyPort
protected int proxyPort
The proxy port
-
proxyAuthDomain
protected java.lang.String proxyAuthDomain
Proxy auth domain
-
proxyAuthUsername
protected java.lang.String proxyAuthUsername
Proxy auth username
-
proxyAuthPassword
protected java.lang.String proxyAuthPassword
Proxy auth password
-
fetcher
protected ThrottledFetcher fetcher
The throttled fetcher used by this instance
-
robots
protected Robots robots
The robots object used by this instance
-
fetcherMap
protected static java.util.Map<java.lang.String,ThrottledFetcher> fetcherMap
Storage for fetcher objects
-
robotsMap
protected static java.util.Map robotsMap
Storage for robots objects
-
isInitialized
protected boolean isInitialized
Flag indicating whether session data is initialized
-
cache
protected static DataCache cache
-
understoodProtocols
protected static final java.util.Map understoodProtocols
-
ACTIVITY_FETCH
public static final java.lang.String ACTIVITY_FETCH
- See Also:
- Constant Field Values
-
ACTIVITY_ROBOTSPARSE
public static final java.lang.String ACTIVITY_ROBOTSPARSE
- See Also:
- Constant Field Values
-
ACTIVITY_PROCESS
public static final java.lang.String ACTIVITY_PROCESS
- See Also:
- Constant Field Values
-
xmlContentTypes
protected static java.util.Set<java.lang.String> xmlContentTypes
-
-
Method Detail
-
getSession
protected void getSession() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionEstablish a session- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getActivitiesList
public java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log).- Specified by:
getActivitiesListin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getActivitiesListin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the list.
-
getConnectorModel
public int getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.- Specified by:
getConnectorModelin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getConnectorModelin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the model type value.
-
connect
public void connect(org.apache.manifoldcf.core.interfaces.ConfigParams configParams)
Connect. The configuration parameters are included.- Specified by:
connectin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
connectin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
configParams- are the configuration parameters for this connection. Note well: There are no exceptions allowed from this call, since it is expected to mainly establish connection parameters.
-
poll
public void poll() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionThis method is periodically called for all connectors that are connected but not in active use.- Specified by:
pollin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
pollin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
check
public java.lang.String check() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck status of connection.- Specified by:
checkin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
checkin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
disconnect
public void disconnect() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionClose the connection. Call this before discarding the repository connector.- Specified by:
disconnectin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
disconnectin classorg.apache.manifoldcf.core.connector.BaseConnector- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getBinNames
public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name string for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.- Specified by:
getBinNamesin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getBinNamesin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
documentIdentifier- is the document identifier.- Returns:
- the bin name.
-
addSeedDocuments
public java.lang.String addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.core.interfaces.Specification spec, java.lang.String lastSeedVersion, long seedTime, int jobMode) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionQueue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The end time and seeding version string passed to this method may be interpreted for greatest efficiency. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding version string to null. The seeding version string may also be set to null on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method. The connector will be connected before this method can be called.- Specified by:
addSeedDocumentsin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
addSeedDocumentsin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
activities- is the interface this method should use to perform whatever framework actions are desired.spec- is a document specification (that comes from the job).seedTime- is the end of the time range of documents to consider, exclusive.lastSeedVersion- is the last seeding version string for this job, or null if the job has no previous seeding version string.jobMode- is an integer describing how the job is being run, whether continuous or once-only.- Returns:
- an updated seeding version string, to be stored with the job.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
makeDocumentIdentifier
protected static java.lang.String makeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies, java.lang.String parentIdentifier, java.lang.String rawURL) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Convert an absolute or relative URL to a document identifier. This may involve several steps at some point, but right now it does NOT involve converting the host name to a canonical host name. (Doing so would destroy the ability of virtually hosted sites to do the right thing, since the original host name would be lost.) Thus, we do the conversion to IP address right before we actually fetch the document.- Parameters:
policies- are the canonicalization policies in effect.parentIdentifier- the identifier of the document in which the raw url was found, or null if none.rawURL- is the raw, un-normalized and un-canonicalized url.- Returns:
- the canonical URL (the document identifier), or null if the url was illegal.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
doCanonicalization
protected static java.lang.String doCanonicalization(RSSConnector.CanonicalizationPolicy p, WebURL url) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.net.URISyntaxException
Code to canonicalize a URL. If URL cannot be canonicalized (and is illegal) return null.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.net.URISyntaxException
-
processDocuments
public void processDocuments(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses, org.apache.manifoldcf.core.interfaces.Specification spec, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionProcess a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job. The connector will be connected before this method can be called.- Specified by:
processDocumentsin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
processDocumentsin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
documentIdentifiers- is the set of document identifiers to process.statuses- are the currently-stored document versions for each document in the set of document identifiers passed in above.activities- is the interface this method should use to queue up new document references and ingest documents.jobMode- is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority- will be true only if the authority in use for these documents is the default one.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
handleIOException
protected static void handleIOException(java.io.IOException e, java.lang.String context) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruption- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
outputConfigurationHeader
public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.- Specified by:
outputConfigurationHeaderin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationHeaderin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputConfigurationBody
public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editconnection".- Specified by:
outputConfigurationBodyin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationBodyin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabName- is the current tab name.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processConfigurationPost
public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".- Specified by:
processConfigurationPostin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
processConfigurationPostin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.variableContext- is the set of variables available from the post, including binary file post information.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewConfiguration
public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags.- Specified by:
viewConfigurationin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
viewConfigurationin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputSpecificationHeader
public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML. The connector will be connected before this method can be called.- Specified by:
outputSpecificationHeaderin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
outputSpecificationHeaderin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputSpecificationBody
public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is always "editjob". The connector will be connected before this method can be called.- Specified by:
outputSpecificationBodyin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
outputSpecificationBodyin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.actualSequenceNumber- is the connection within the job that has currently been selected.tabName- is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within the job.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processSpecificationPost
public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is always "editjob". The connector will be connected before this method can be called.- Specified by:
processSpecificationPostin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
processSpecificationPostin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
variableContext- contains the post data, including binary file-upload information.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewSpecification
public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags. The connector will be connected before this method can be called.- Specified by:
viewSpecificationin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
viewSpecificationin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the locale the output is preferred to be in.ds- is the current document specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
handleRSSFeedSAX
protected void handleRSSFeedSAX(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, RSSConnector.Filter filter) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionHandle an RSS feed document, using SAX to limit the memory impact- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
getMaxDocumentRequest
public int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.- Specified by:
getMaxDocumentRequestin interfaceorg.apache.manifoldcf.crawler.interfaces.IRepositoryConnector- Overrides:
getMaxDocumentRequestin classorg.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector- Returns:
- the maximum number. 0 indicates "unlimited".
-
getFetcher
protected ThrottledFetcher getFetcher()
Given the current parameters, find the correct throttled fetcher object (or create one if not there).
-
stringToArray
protected static java.util.List<java.lang.String> stringToArray(java.lang.String input)
Read a string as a sequence of individual expressions, urls, etc.
-
compileList
protected static void compileList(java.util.List<java.util.regex.Pattern> output, java.util.List<java.lang.String> input) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCompile all regexp entries in the passed in list, and add them to the output list.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getRobots
protected Robots getRobots(ThrottledFetcher fetcher)
Given the current parameters, find the correct robots object (or create one if none found).
-
-