Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This stage is primarily intended to split an Archive file containing a list of items(documents, folder, etc), and then to process each individual record one at a time, as sub-jobs on their own pipeline.

This stage takes an job which contains a data stream. It assumes that the data stream represents an Archive file and then process every entry in the file to create sub-job documents.


Table of Contents

Archive Files Extractor
Factory Namecom.searchtechnologiesaccenture.aspire:aspire-archive-file-extractor
subType

default

InputsJob containing a data stream (object['contentStream'] which is a stream to the Archive File to process).
OutputsOne subDocument for each entry in the archive file, submitted as a subjob.

Supported Files

The process is able to extract and process these file types:

  • ZIP
  • AR
  • ARJ
  • CPIO
  • JAR
  • DUMP
  • TAR

Known Limitations

  • RAR is a proprietary algorithm and was not included for this version.
  • 7z does not support stream opening so it was excluded from this version
    .
  • Sometimes depends of the encoding of the archive files, the folders are not returned as archives entries, so you could not see it in the result jobs.
  • If the Archive files are excluded from the crawl, the Scan Excluded Items option will not work for this kind of items.
  • At the moment the "Delete by Query" functionality of the component only works with the Elasticsearch Publisher
  • "Delete by Query" implementation in the rest of the available publishers is still pending.
  • Since JAR files are fundamentally archive files, built on the ZIP file format with the .jar file extension, the auto discovery method sometimes got the zip files when only jar types are selected and viceversa.
  • It is required to share the Archive Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.
  • Configuration

    ElementTypeDefaultDescription
    indexContainersBooleantrueIndicates if folders are to be indexed.
    scanRecursivelyBooleantrueIndicates if the subfolders are to be scanned.
    parentInfoBooleantrueIndicates if the parent (Archive file) info will added to every entry of the file.
    deleteFirstBooleanfalseIndicates if the delete by query will be send before or after process the archive file.
    indexArchiveBooleanfalseIndicates if the archive job will be terminate so no further stages will be run for this job.

    (All the entries of the archive file will be processed).

    discoveryMethodStringAIIndicates the method to recognize an archive file

    (AI - Auto Identify or RE - Regex).

    checkListString 
    Indicates the list of types using in the Auto discovery method.
    archivesRegexString 
    Indicates the regular expresion using in the discovery method.

    Example configuration

    Code Block
    languagexml
    themeRDark
    <component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-archive-file-extractor">
       <indexContainers>${indexContainers}</indexContainers>
       <scanRecursively>${scanRecursively}</scanRecursively>
       <parentInfo>${parentInfo}</parentInfo>
       <deleteFirst>${deleteFirst}</deleteFirst>
       <terminate>${terminate}</terminate>
       <discoveryMethod>${discoveryMethod}</discoveryMethod>
       <mimetypeList>${mimetypeList}</mimetypeList>
       <archivesRegex>${archivesRegex}</archivesRegex>
       <debug>${debug}</debug>
       <branches>
          <branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="true"
    	batchSize="${batchSize}" batchTimeout="${batchTimeout}" simultaneousBatches="${simultaneousBatches}" />
              
          <branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="true"
    	batchSize="${batchSize}" batchTimeout="${batchTimeout}" simultaneousBatches="${simultaneousBatches}" />
       </branches>
    </component>
    
    
    Note that you need to configure two pipelines due the branches for "AddUpdate" and "Delete" subjobs. Additional you can add and Extract Text component in order to get the content of every entry.
    
    <component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application">
       <debug>${debug}</debug>
       <gatherStatistics>${debug}</gatherStatistics>
       <pipelines>
          <pipeline name="addUpdatePipeline" default="true">
             <script> 
                <![CDATA[
                   .....			
    	    ]]>
    	 </script>
          </pipeline>
       </pipelines>
    </component>
    
    <component name="DeletePM" subType="pipeline" factoryName="aspire-application">
       <debug>${debug}</debug>
       <gatherStatistics>${debug}</gatherStatistics>
       <pipelines>
          <pipeline name="deletePipeline" default="true">
             <script> 
                <![CDATA[
                   .....			
    	    ]]>
    	 </script>
          </pipeline>
       </pipelines>
    </component>

    Install as a Workflow Application

    In order to install the component as a workflow application, you need:

    1. Install and configure a Content Source for the crawl.
    2. Go to the AddUpdate pipeline.
    3. Go to the workflow section and add a "Custom Application Function"
    4. Add the app-archive-extractor component
    1. Configure it:
      1. General
      2. Discovery Method
      3. Batching
      4. Extract Text (*)
      5. Routing
    2. Once you save the component, share it in a library (this is required).
    3. Add it into the Delete pipeline (from the shared library, this is required)
    4. Save the connector.

    (*) Note: In order to extract the content of the files inside the Archive File you need to disable the extract text of the connector and Configure it in the Archive File Component. So you need to add a rule for the extract text of the others jobs from the crawl (you can share the extract text in the same library used before).

    Extract TextImage Removed