Class HtmlExtractor
- java.lang.Object
-
- org.apache.manifoldcf.core.connector.BaseConnector
-
- org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
-
- org.apache.manifoldcf.agents.transformation.htmlextractor.HtmlExtractor
-
- All Implemented Interfaces:
org.apache.manifoldcf.agents.interfaces.IPipelineConnector,org.apache.manifoldcf.agents.interfaces.ITransformationConnector,org.apache.manifoldcf.core.interfaces.IConnector
public class HtmlExtractor extends org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static interfaceHtmlExtractor.DestinationStorageprotected static classHtmlExtractor.FileDestinationStorageprotected static classHtmlExtractor.MemoryDestinationStorageprotected static classHtmlExtractor.SpecPacker
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidprotected static java.lang.String[]activitiesListprotected static java.lang.StringACTIVITY_PROCESSstatic java.lang.StringATTRIBUTE_SOURCEstatic java.lang.StringATTRIBUTE_TARGETstatic java.lang.StringATTRIBUTE_VALUEprotected static intHTML_STRIP_ALLprotected static intHTML_STRIP_NONEprotected static inthtml_strip_usageprotected static longinMemoryMaximumFileWe handle up to 64K in memory; after that we go to disk.static java.lang.StringNODE_FILTEREMPTYstatic java.lang.StringNODE_KEEPMETADATA
-
Constructor Summary
Constructors Constructor Description HtmlExtractor()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description intaddOrReplaceDocumentWithException(java.lang.String documentURI, org.apache.manifoldcf.core.interfaces.VersionContext pipelineDescription, org.apache.manifoldcf.agents.interfaces.RepositoryDocument document, java.lang.String authorityNameString, org.apache.manifoldcf.agents.interfaces.IOutputAddActivity activities)Add (or replace) a document in the output data store using the connector.protected static voidfillInHtmlExtractorSpecification(java.util.Map<java.lang.String,java.lang.Object> paramMap, org.apache.manifoldcf.core.interfaces.Specification os)java.lang.String[]getActivitiesList()Return a list of activities that this connector generates.voidoutputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)Output the configuration body section.voidoutputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray)Output the configuration header section.voidoutputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName)Output the specification body section.voidoutputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray)Output the specification header section.java.lang.StringprocessConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)Process a configuration post.java.lang.StringprocessSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber)Process a specification post.voidviewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)View configuration.voidviewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber)View specification.-
Methods inherited from class org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
checkDateIndexable, checkDocumentIndexable, checkLengthIndexable, checkMimeTypeIndexable, checkURLIndexable, getFormCheckJavascriptMethodName, getFormPresaveCheckJavascriptMethodName, getPipelineDescription, requestInfo
-
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, poll, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
-
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
ACTIVITY_PROCESS
protected static final java.lang.String ACTIVITY_PROCESS
- See Also:
- Constant Field Values
-
activitiesList
protected static final java.lang.String[] activitiesList
-
HTML_STRIP_NONE
protected static final int HTML_STRIP_NONE
- See Also:
- Constant Field Values
-
HTML_STRIP_ALL
protected static final int HTML_STRIP_ALL
- See Also:
- Constant Field Values
-
html_strip_usage
protected static int html_strip_usage
-
NODE_KEEPMETADATA
public static final java.lang.String NODE_KEEPMETADATA
- See Also:
- Constant Field Values
-
NODE_FILTEREMPTY
public static final java.lang.String NODE_FILTEREMPTY
- See Also:
- Constant Field Values
-
ATTRIBUTE_SOURCE
public static final java.lang.String ATTRIBUTE_SOURCE
- See Also:
- Constant Field Values
-
ATTRIBUTE_TARGET
public static final java.lang.String ATTRIBUTE_TARGET
- See Also:
- Constant Field Values
-
ATTRIBUTE_VALUE
public static final java.lang.String ATTRIBUTE_VALUE
- See Also:
- Constant Field Values
-
inMemoryMaximumFile
protected static final long inMemoryMaximumFile
We handle up to 64K in memory; after that we go to disk.- See Also:
- Constant Field Values
-
-
Method Detail
-
getActivitiesList
public java.lang.String[] getActivitiesList()
Return a list of activities that this connector generates. The connector does NOT need to be connected before this method is called.- Specified by:
getActivitiesListin interfaceorg.apache.manifoldcf.agents.interfaces.ITransformationConnector- Overrides:
getActivitiesListin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Returns:
- the set of activities.
-
addOrReplaceDocumentWithException
public int addOrReplaceDocumentWithException(java.lang.String documentURI, org.apache.manifoldcf.core.interfaces.VersionContext pipelineDescription, org.apache.manifoldcf.agents.interfaces.RepositoryDocument document, java.lang.String authorityNameString, org.apache.manifoldcf.agents.interfaces.IOutputAddActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruption, java.io.IOExceptionAdd (or replace) a document in the output data store using the connector. This method presumes that the connector object has been configured, and it is thus able to communicate with the output data store should that be necessary. The OutputSpecification is *not* provided to this method, because the goal is consistency, and if output is done it must be consistent with the output description, since that was what was partly used to determine if output should be taking place. So it may be necessary for this method to decode an output description string in order to determine what should be done.- Specified by:
addOrReplaceDocumentWithExceptionin interfaceorg.apache.manifoldcf.agents.interfaces.IPipelineConnector- Overrides:
addOrReplaceDocumentWithExceptionin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Parameters:
documentURI- is the URI of the document. The URI is presumed to be the unique identifier which the output data store will use to process and serve the document. This URI is constructed by the repository connector which fetches the document, and is thus universal across all output connectors.pipelineDescription- is the description string that was constructed for this document by the getOutputDescription() method.document- is the document data to be processed (handed to the output data store).authorityNameString- is the name of the authority responsible for authorizing any access tokens passed in with the repository document. May be null.activities- is the handle to an object that the implementer of a pipeline connector may use to perform operations, such as logging processing activity, or sending a modified document to the next stage in the pipeline.- Returns:
- the document status (accepted or permanently rejected).
- Throws:
java.io.IOException- only if there's a stream error reading the document data.org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
outputConfigurationHeader
public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.- Specified by:
outputConfigurationHeaderin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationHeaderin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputConfigurationBody
public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editconnection".- Specified by:
outputConfigurationBodyin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
outputConfigurationBodyin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.tabName- is the current tab name.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processConfigurationPost
public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".- Specified by:
processConfigurationPostin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
processConfigurationPostin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.variableContext- is the set of variables available from the post, including binary file post information.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewConfiguration
public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body> tags.- Specified by:
viewConfigurationin interfaceorg.apache.manifoldcf.core.interfaces.IConnector- Overrides:
viewConfigurationin classorg.apache.manifoldcf.core.connector.BaseConnector- Parameters:
threadContext- is the local thread context.out- is the output to which any HTML should be sent.parameters- are the configuration parameters, as they currently exist, for this connection being configured.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
fillInHtmlExtractorSpecification
protected static void fillInHtmlExtractorSpecification(java.util.Map<java.lang.String,java.lang.Object> paramMap, org.apache.manifoldcf.core.interfaces.Specification os)
-
outputSpecificationHeader
public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification header section. This method is called in the head section of a job page which has selected a pipeline connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.- Specified by:
outputSpecificationHeaderin interfaceorg.apache.manifoldcf.agents.interfaces.IPipelineConnector- Overrides:
outputSpecificationHeaderin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Parameters:
out- is the output to which any HTML should be sent.locale-os- is the current pipeline specification for this connection.connectionSequenceNumber- is the unique number of this connection within the job.tabsArray- is an array of tab names. Add to this array any tab names that are specific to the connector.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
outputSpecificationBody
public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionOutput the specification body section. This method is called in the body section of a job page which has selected a pipeline connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editjob".- Specified by:
outputSpecificationBodyin interfaceorg.apache.manifoldcf.agents.interfaces.IPipelineConnector- Overrides:
outputSpecificationBodyin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the preferred local of the output.os- is the current pipeline specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.actualSequenceNumber- is the connection within the job that has currently been selected.tabName- is the current tab name.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
processSpecificationPost
public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the transformation specification accordingly. The name of the posted form is "editjob".- Specified by:
processSpecificationPostin interfaceorg.apache.manifoldcf.agents.interfaces.IPipelineConnector- Overrides:
processSpecificationPostin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Parameters:
variableContext- contains the post data, including binary file-upload information.locale- is the preferred local of the output.os- is the current pipeline specification for this job.connectionSequenceNumber- is the unique number of this connection within the job.- Returns:
- null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
viewSpecification
public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification os, int connectionSequenceNumber) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionView specification. This method is called in the body section of a job's view page. Its purpose is to present the pipeline specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags.- Specified by:
viewSpecificationin interfaceorg.apache.manifoldcf.agents.interfaces.IPipelineConnector- Overrides:
viewSpecificationin classorg.apache.manifoldcf.agents.transformation.BaseTransformationConnector- Parameters:
out- is the output to which any HTML should be sent.locale- is the preferred local of the output.connectionSequenceNumber- is the unique number of this connection within the job.os- is the current pipeline specification for this job.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
-