This tutorial walks through the steps necessary to use PST Extractor on Aspire. This stage is primarily intended to split an Outlook PST file containing a list of items(folders, emails, attachments, appointments etc ..), and then to process each individual record one at a time, as sub-jobs on their own pipeline.
This stage takes an job which contains a data stream. It assumes that the data stream represents an PST file and then process every entry in the file to create sub-job documents.
Element | Type | Default | Description |
---|---|---|---|
indexContainers | Boolean | true | Indicates if folders are to be indexed. |
scanRecursively | Boolean | true | Indicates if the subfolders are to be scanned. |
parentInfo | Boolean | true | Indicates if the parent (Archive file) info will added to every entry of the file. |
deleteFirst | Boolean | false | Indicates if the delete by query will be send before or after process the PST file. |
indexArchive | Boolean | false | Indicates if the PST(archive) job will be terminate so no further stages will be run for this job. (All the entries of the archive file will be processed). |
discoveryMethod | String | AI | Indicates the method to recognize an PST file (AI - Auto Identify or RE - Regex). |
archivesRegex | String | Indicates the regular expression using in the discovery method. |
<component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-pst-extractor"> <indexContainers>${indexContainers}</indexContainers> <scanRecursively>${scanRecursively}</scanRecursively> <parentInfo>${parentInfo}</parentInfo> <deleteFirst>${deleteFirst}</deleteFirst> <indexArchive>${indexArchive}</indexArchive> <discoveryMethod>${discoveryMethod}</discoveryMethod> <checklist>${checklist}</checklist> <archivesRegex>${archivesRegex}</archivesRegex> <debug>${debug}</debug> <branches> <branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="false" /> <branch event="onErrorSubJob" pipelineManager="ErrorPM" batching="false" /> <branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="false" /> </branches> </component>
Note that you need to configure two pipelines due the branches for "AddUpdate" and "Delete" subjobs. Additional you can add and Extract Text component in order to get the content of every entry.
<component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application"> <debug>${debug}</debug> <gatherStatistics>${debug}</gatherStatistics> <pipelines> <pipeline name="addUpdatePipeline" default="true"> <script> <![CDATA[ ..... ]]> </script> </pipeline> </pipelines> </component> <component name="DeletePM" subType="pipeline" factoryName="aspire-application"> <debug>${debug}</debug> <gatherStatistics>${debug}</gatherStatistics> <pipelines> <pipeline name="deletePipeline" default="true"> <script> <![CDATA[ ..... ]]> </script> </pipeline> </pipelines> </component>
In order to install the component as a workflow application, you need:
(*) Note: In order to extract the content of the files inside the PST File you need to disable the extract text of the connector and Configure it in the PST extractor Component. So you need to add a rule for the extract text of the others jobs from the crawl (you can share the extract text in the same library used before).