Welcome to PST (Outlook archive) Extractor for Aspire. A central location for all information on crawling and processing content using the Aspire PST Extractor and associated components.


Introduction


The PST Extractor will crawl content from any PST (Microsoft Outlook archive) file or PST Stream. The extractor will request and retrieve outlook folders, emails, appointments and attachments.

Note that the PST Extractor is not part of Aspire Community bundle, but may be purchased separately

Some of the features of the PST Extractor include:

  • Ability to perform either full or incremental extraction (so that only new/updated documents are indexed)
  • Fetches metadata for retrieved content
  • Is search engine independent. The content retrieved can be published by Aspire to any search engine


Crawling PST Extractor Tutorial


This tutorial walks through the steps necessary to use PST Extractor on Aspire. This stage is primarily intended to split an Outlook PST file containing a list of items(folders, emails, attachments, appointments etc ..), and then to process each individual record one at a time, as sub-jobs on their own pipeline.

This stage takes an job which contains a data stream. It assumes that the data stream represents an PST file and then process every entry in the file to create sub-job documents.


Known Limitations

  • Extractor doesn't care about the security on PST file (password protection for PST file). Extractor can process a secured PST file as a normal file.
  • Extractor will create temporary files on the default temporary file location of the system (for PST files those are not reside no the disk). 
    • i.e path "C:\Users\username\AppData\Local\Temp" on windows
  • If the PST files are excluded from the crawl, the Scan Excluded Items option will not work for this kind of items.
  • At the moment the "Delete by Query" functionality of the component only works with the Elasticsearch and Solr Publishers.
  • "Delete by Query" implementation in the rest of the available publishers is still pending.
  • It is required to share the PST Extractor rule into a shared library in order to use the same rule on both the onAddUpdate and onDelete stages.

Configuration

ElementTypeDefaultDescription
indexContainersBooleantrueIndicates if folders are to be indexed.
scanRecursivelyBooleantrueIndicates if the subfolders are to be scanned.
parentInfoBooleantrueIndicates if the parent (Archive file) info will added to every entry of the file.
deleteFirstBooleanfalseIndicates if the delete by query will be send before or after process the PST file.
indexArchiveBooleanfalseIndicates if the PST(archive) job will be terminate so no further stages will be run for this job.

(All the entries of the archive file will be processed).

discoveryMethodStringAIIndicates the method to recognize an PST file

(AI - Auto Identify or RE - Regex).

archivesRegexString
Indicates the regular expression using in the discovery method.

Example configuration

	<component name="ArchiveSubJobExtractor" subType="default" factoryName="aspire-pst-extractor">
		<indexContainers>${indexContainers}</indexContainers>
		<scanRecursively>${scanRecursively}</scanRecursively>
		<parentInfo>${parentInfo}</parentInfo>
		<deleteFirst>${deleteFirst}</deleteFirst>
		<indexArchive>${indexArchive}</indexArchive>
		<discoveryMethod>${discoveryMethod}</discoveryMethod>
		<checklist>${checklist}</checklist>
		<archivesRegex>${archivesRegex}</archivesRegex>
		<debug>${debug}</debug>
		<branches>
			<branch event="onAddSubJob" pipelineManager="AddUpdatePM" batching="false" />

			<branch event="onErrorSubJob" pipelineManager="ErrorPM" batching="false" />

			<branch event="onDeleteSubJob" pipelineManager="DeletePM" batching="false" />
		</branches>
	</component>

Note that you need to configure two pipelines due the branches for "AddUpdate" and "Delete" subjobs. Additional you can add and Extract Text component in order to get the content of every entry. 

<component name="AddUpdatePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="addUpdatePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>

<component name="DeletePM" subType="pipeline" factoryName="aspire-application">
   <debug>${debug}</debug>
   <gatherStatistics>${debug}</gatherStatistics>
   <pipelines>
      <pipeline name="deletePipeline" default="true">
         <script> 
            <![CDATA[
               .....			
	    ]]>
	 </script>
      </pipeline>
   </pipelines>
</component>	    

Install as a Workflow Application

In order to install the component as a workflow application, you need:

  1. Install and configure a Content Source for the crawl.
  2. Go to the AddUpdate pipeline.
  3. Go to the workflow section and add a "Custom Application Function"
  4. Add the app-pst-extractor component
  1. Configure it:
    1. General
    2. Batching
    3. Extract Text (*)
    4. Routing
  2. Once you save the component, share it in a library (this is required).
  3. Add it into the Delete pipeline  (from the shared library, this is required)
  4. Save the connector.

(*) Note: In order to extract the content of the files inside the PST File you need to disable the extract text of the connector and Configure it in the PST extractor Component. So you need to add a rule for the extract text of the others jobs from the crawl (you can share the extract text in the same library used before).



  • No labels