The Archive Extractor is primarily intended to split an Archive file containing a list of items (documents, folder, etc), and then to process each individual record one at a time, as sub-jobs on their own pipeline.

Features

The process is able to extract and process these file types:

ZIP
AR
ARJ
CPIO
JAR
DUMP
TAR

Known Limitations

RAR is a proprietary algorithm and was not included for this version.
7z does not support stream opening so it was excluded from this version.
Sometimes depends of the encoding of the archive files, the folders are not returned as archives entries, so you could not see it in the result jobs.
If the Archive files are excluded from the crawl, the Scan Excluded Items option will not work for this kind of items.
At the moment the "Delete by Query" functionality of the component only works with the Elasticsearch Publisher
"Delete by Query" implementation in the rest of the available publishers is still pending.
Since JAR files are fundamentally archive files, built on the ZIP file format with the .jar file extension, the auto discovery method sometimes got the zip files when only jar types are selected and viceversa.
It is required to share the Archive Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.

Page tree

Archive Extractor Introduction

Features

Known Limitations