Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


The

Archive Extractor is primarily intended to split an Archive file containing a list of items (documents, folder, etc), and then to process each individual record one at a time, as sub-jobs on their own pipeline.

On this page:

Table of Contents

Aspire Python Bridge is a component that allows to run Python scripts from Aspire by passing the Aspire Object in the job, via HTTP, to a server that will run the Python code and then return another Aspire Object back to Aspire. This object will then continue processing on the pipeline (unless an error occurs).




Note

The HTTP server is implemented in Flask, so it is mandatory to have Python and Flask installed in order to use the component

Features

The process is able to extract and process these file types:

  • ZIP
  • AR
  • ARJ
  • CPIO
  • JAR
  • DUMP
  • TAR

Known Limitations

  • RAR is a proprietary algorithm and was not included for this version.
  • 7z does not support stream opening so it was excluded from this version.
  • Depending on the encoding of the archive files, the folders might not get returned as archive entries, so you could not see it in the result jobs.
  • If the Archive files are excluded from the crawl, the Scan Excluded Items option will not work for this kind of items.
  • At the moment the "Delete by Query" functionality of the component only works with the Elasticsearch Publisher
  • "Delete by Query" implementation for the rest of the available publishers is still pending.
  • Since JAR files are fundamentally archive files built on the ZIP file format using the .jar file extension, the auto discovery method will sometimes get the zip files when jar types are selected and vice versa.
  • It is required to share the Archive Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.
Note

Take in consideration that due the Archive Extractor is an workflow application in a pipeline after the connector. the files extracted by the component will not be counted in the connector statistics.