This stage is primarily intended to split an Archive file containing a list of items(documents, folder, etc), and then to process each individual record one at a time, as sub-jobs on their own pipeline.

This stage takes an job which contains a data stream. It assumes that the data stream represents an Archive file and then process every entry in the file to create sub-job documents.

Archive Files Extractor
Factory Name	com.searchtechnologies.aspire:aspire-archive-file-extractor
subType	default
Inputs	Job containing a data stream (object['contentStream'] which is a stream to the Archive File to process).
Outputs	One subDocument for each entry in the archive file, submitted as a subjob.

Configuration

Element	Type	Default	Description
indexContainers	Boolean	true	Indicates if folders are to be indexed.
scanRecursively	Boolean	true	Indicates if the subfolders are to be scanned.
parentInfo	Boolean	true	Indicates if the parent (Archive file) info will added to every entry of the file.
deleteFirst	Boolean	false	Indicates if the delete by query will be send before or after process the archive file.
indexArchive	Boolean	false	Indicates if the archive job will be terminate so no further stages will be run for this job. (All the entries of the archive file will be processed).
discoveryMethod	String	AI	Indicates the method to recognize an archive file (AI - Auto Identify or RE - Regex).
checkList	String		Indicates the list of types using in the Auto discovery method.
archivesRegex	String		Indicates the regular expresion using in the discovery method.

Example configuration

<component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-archive-file-extractor">
   <indexContainers>${indexContainers}</indexContainers>
   <scanRecursively>${scanRecursively}</scanRecursively>
   <parentInfo>${parentInfo}</parentInfo>
   <deleteFirst>${deleteFirst}</deleteFirst>
   <terminate>${terminate}</terminate>
   <discoveryMethod>${discoveryMethod}</discoveryMethod>
   <mimetypeList>${mimetypeList}</mimetypeList>
   <archivesRegex>${archivesRegex}</archivesRegex>
   <debug>${debug}</debug>
   <branches>
      <branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="true"
	batchSize="${batchSize}" batchTimeout="${batchTimeout}" simultaneousBatches="${simultaneousBatches}" />
          
      <branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="true"
	batchSize="${batchSize}" batchTimeout="${batchTimeout}" simultaneousBatches="${simultaneousBatches}" />
   </branches>
</component>

Note that you need to configure two pipelines due the branches for "AddUpdate" and "Delete" subjobs. Additional you can add and Extract Text component in order to get the content of every entry.
<component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="addUpdatePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>

<component name="DeletePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="deletePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>

Page tree

Archive Files Extractor

Configuration

Example configuration