Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This tutorial walks through the steps

necessary

that are needed in order to use the PST Extractor on Aspire.

This stage is

primarily

intended to:

  1. Split an Outlook PST file containing a list of items: folders, emails, attachments, appointments, etc.
Then process
  1. Process each
individual
  1. record one at a time, as sub-jobs, on their own pipeline.
Tip

This stage takes a job that contains a data stream

. It

, assumes that the data stream represents a PST file, and

then

processes

every

each entry in the file to create sub-job documents.


Known Limitations


  1.  The PST Extractor doesn't care about the security on a PST file (password protection for PST file).
    Extractor
    • It can process a
  • secured
    • secure PST file as a normal file.
Extractor will create temporary files on

2.  On the default temporary file location of the system, the Extractor will create temporary files (for PST files that do not reside on the disk).

    • That is, path "C:\Users\username\AppData\Local\Temp" on
  • windows
If
    • Windows.

3.  If the PST files are excluded from the crawl, the "Scan Excluded Items" option will not work for this kind of item.

At

4.  At the moment, the "Delete by Query" functionality of the component only works with the Elasticsearch and Solr Publishers.

    • "Delete by Query" implementation in the rest of the available publishers is still pending.
You

5.  You must share the PST Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.

Configuration


ElementTypeDefaultDescription
indexContainersBooleantrueIndicates if folders are to be indexed.
scanRecursivelyBooleantrueIndicates if the subfolders are to be scanned.
parentInfoBooleantrueIndicates if the parent (Archive file) info will added to every entry of the file.
deleteFirstBooleanfalseIndicates if the delete by query will be send before or after process the PST file.
indexArchiveBooleanfalseIndicates if the PST(archive) job will be terminate so no further stages will be run for this job.

(All the entries of the archive file will be processed).

discoveryMethodStringAIIndicates the method to recognize an PST file

(AI - Auto Identify or RE - Regex).

archivesRegexString
Indicates the regular expression using in the discovery method.

Example

configuration

Configuration

	<component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-pst-extractor">
		<indexContainers>${indexContainers}</indexContainers>
		<scanRecursively>${scanRecursively}</scanRecursively>
		<parentInfo>${parentInfo}</parentInfo>
		<deleteFirst>${deleteFirst}</deleteFirst>
		<indexArchive>${indexArchive}</indexArchive>
		<discoveryMethod>${discoveryMethod}</discoveryMethod>
		<checklist>${checklist}</checklist>
		<archivesRegex>${archivesRegex}</archivesRegex>
		<debug>${debug}</debug>
		<branches>
			<branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="false" />

			<branch event="onErrorSubJob" pipelineManager="ErrorPM" batching="false" />

			<branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="false" />
		</branches>
	</component>
Note:


Note

You need to configure two pipelines due to the branches for the "AddUpdate" and "Delete" subjobs. Additionally, you can Add and Extract Text components in order to get the content of every entry.


<component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="addUpdatePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>

<component name="DeletePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="deletePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>	    

Install as a Workflow Application



In order to install Install the component as a workflow application, you need:

  1. Install and configure a Content Source for the crawl.
  2. Go to the AddUpdate pipeline.
  3. Go to the workflow Workflow section and add a "Custom Application Function".
  4. Add the app-pst-extractor component.

Add new PST Extractor


5.Configure itComplete the configuration for:

    • General
    • Batching

General Properties


6. Complete the Extract Text (*) 7and Routing information. Routing

Extract Text and Routing Properties


87. After you save the component, share it in a library (this is required).

98. Add it into the Delete pipeline ( from the shared library - this is (required).

109. Save the connector.

(*) Note: In order to extract the content of the files inside the PST File you need to , disable the extract text of the connector and Configure configure it in the PST extractor Component. So you need to add Extractor component. Add a rule for the extract text of the others jobs from the crawl. (you You can share the extract text in the same library used before).

Share Library