2. On the default temporary file location of the system, the Extractor will create temporary files (for PST files that do not reside on the disk).
3. If the PST files are excluded from the crawl, the "Scan Excluded Items" option will not work for this kind of item.
4. At the moment, the "Delete by Query" functionality of the component only works with the Elasticsearch and Solr Publishers.
5. You must share the PST Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.
Element | Type | Default | Description |
---|---|---|---|
indexContainers | Boolean | true | Indicates if folders are to be indexed. |
scanRecursively | Boolean | true | Indicates if the subfolders are to be scanned. |
parentInfo | Boolean | true | Indicates if the parent (Archive file) info will added to every entry of the file. |
deleteFirst | Boolean | false | Indicates if the delete by query will be send before or after process the PST file. |
indexArchive | Boolean | false | Indicates if the PST(archive) job will be terminate so no further stages will be run for this job. (All the entries of the archive file will be processed). |
discoveryMethod | String | AI | Indicates the method to recognize an PST file (AI - Auto Identify or RE - Regex). |
archivesRegex | String | Indicates the regular expression using in the discovery method. |
<component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-pst-extractor"> <indexContainers>${indexContainers}</indexContainers> <scanRecursively>${scanRecursively}</scanRecursively> <parentInfo>${parentInfo}</parentInfo> <deleteFirst>${deleteFirst}</deleteFirst> <indexArchive>${indexArchive}</indexArchive> <discoveryMethod>${discoveryMethod}</discoveryMethod> <checklist>${checklist}</checklist> <archivesRegex>${archivesRegex}</archivesRegex> <debug>${debug}</debug> <branches> <branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="false" /> <branch event="onErrorSubJob" pipelineManager="ErrorPM" batching="false" /> <branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="false" /> </branches> </component>
You need to configure two pipelines due to the branches for the "AddUpdate" and "Delete" subjobs. Additionally, you can Add and Extract Text components in order to get the content of every entry.
<component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application"> <debug>${debug}</debug> <gatherStatistics>${debug}</gatherStatistics> <pipelines> <pipeline name="addUpdatePipeline" default="true"> <script> <![CDATA[ ..... ]]> </script> </pipeline> </pipelines> </component> <component name="DeletePM" subType="pipeline" factoryName="aspire-application"> <debug>${debug}</debug> <gatherStatistics>${debug}</gatherStatistics> <pipelines> <pipeline name="deletePipeline" default="true"> <script> <![CDATA[ ..... ]]> </script> </pipeline> </pipelines> </component>
Install the component as a workflow application:
5.Complete the configuration for:
6. Complete the Extract Text (*) and Routing information.
7. After you save the component, share it in a library (required).
8. Add it into the Delete pipeline from the shared library (required).
9. Save the connector.
(*) Note: In order to extract the content of the files inside the PST File, disable the extract text of the connector and configure it in the PST Extractor component. Add a rule for the extract text of the others jobs from the crawl. (You can share the extract text in the same library used before).