File System Staging Repository Connector Component

The File System Staging Repository Connector component performs full and incremental scans over a file system staging repository. If stores the id of the last transaction processed so that only new transactions are processed on subsequent executions. The transaction contains the URL of the item in the repository and the action of this transaction. The content from the repository is loaded in to AspireObjects attached to Jobs and then submitted to the configured pipeline. Updated content is split into three types -add, update and delete-. Each type of content is published as a different event so that it may be handled by different Aspire pipelines.

The scanner reacts to an incoming job. This job may instruct the scanner to start, stop, pause or resume. Typically the start job will contain all information required by the job to perform the crawl. However, the scanner can be configured with default values via application.xml file. When pausing or stopping, the scanner will wait until all the jobs it published have completed before completing itself.

File System Staging Repository Connector Component
Factory Name	com.searchtechnologies.aspire:aspire-file-repo-connector
subType	default
Inputs	AspireObject from a content source submitter holding all the information required for a crawl
Outputs	Jobs from the crawl

Configuration

This section lists all configuration parameters available to configure the File System Scanner component.

General Scanner Component Configuration

Basic Scanner Configuration

Element	Type	Default	Description
snapshotDir	String	snapshots	The directory for snapshot files.
numOfSnapshotBackups	int	2	The number of snapshots to keep after processing.
waitForSubJobsTimeout	long	600000 (=10 mins)	Scanner timeout while waiting for published jobs to complete.
maxOutstandingTimeStatistics	long	1m	The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file.
maxOutstandingUpdatesStatistics	long	1000	The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file.
usesDomain	boolean	true	Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander).

Branch Handler Configuration

This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.

Element	Type	Description
branches/branch/@event	string	The event to configure - onAdd, onDelete or onUpdate.
branches/branch/@pipelineManager	string	The name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline	string	The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.
branches/branch/@allowRemote	boolean	Indicates if this pipeline can be found on remote servers (see Distributed Processing for details).
branches/branch/@batching	boolean	Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing).
branches/branch/@batchSize	int	The max size of the batches that the branch handler will created.
branches/branch/@batchTimeout	long	Time to wait before the batch is closed if the batchSize hasn't been reached.
branches/branch/@simultaneousBatches	int	The max number of simultanous batches that will be handled by the branch handler.

Staging Repository Scanner Component Configuration

Staging repository scanners are able to process updates sent over JMS messaging queue as well as those read from the repository. This allows the scanner to receive updates from the publisher and perform near real time updates. In order to receive these updates, the scanner must be configured with the server URL and topic or queue name:

The scanner has a built in ActiveMQ client, but can be configured to connect to any JMS server using JNDI.

The JMS parameters are shown below.

JMS Updates configuration

Element	Type	Default	Description
updates/@enabled	boolean	false	If true, enable reception of JMS messages in the scanner
updates/@broker	string		The JMS message broker eg tcp://localhost:61616
updates/@channel	string		The name of the JMS queue or topic to listen on
updates/@topic	boolean	false	If true, the channel named in the @channel attribute is a topic. If false, it is a queue.
updates/@durable	boolean	false	If true, the channel is durable.
updates/@transacted	boolean	false	Use JMS transactions
updates/jndi/@enabled	boolean	false	Use JNDI to connect to servers other than ApacheMQ
updates/jndi/@factory	string		The connection factory to use when using JNDI
updates/jndi/classpath	String		The class path for JNDI libraries
updates/properties/property/@name	String		The name of a property to pass to the JNDI connection
updates/properties/property	String		The property value to pass to the JNDI connection

File System Staging Repository Connector Specific Configuration

None

Configuration Example

  <component factoryName="aspire-file-repo-connector" name="Scanner" subType="default">
    <debug>false</debug>
    <fullRecovery>incremental</fullRecovery>
    <incrementalRecovery>incremental</incrementalRecovery>
    <snapshotDir>${data.dir}/StagingToEngine/snapshots</snapshotDir>
    <waitForSubJobsTimeout>10m</waitForSubJobsTimeout>
    <emitCrawlStartJobs>false</emitCrawlStartJobs>
    <emitCrawlEndJobs>false</emitCrawlEndJobs>
    <enableAuditing>true</enableAuditing>
    <updates broker="tcp://localhost:61616" channel="demoQueue" enabled="true" topic="true"/>
    <branches>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onAdd" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onUpdate" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onDelete" pipeline="deletePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" event="onClear" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
      <branch allowRemote="true" event="onCrawlStart" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
      <branch allowRemote="true" event="onCrawlEnd" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
    </branches>
  </component>

Source Configuration

Scanner Control Configuration

The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.

Element	Type	Options	Description
@action	string	start, stop, pause, resume, abort	Control command to tell the scanner which operation to perform. Use start option to launch a new crawl.
@actionProperties	string	full, incremental	When a start @action is received, it will tell the scanner to either run a full or an incremental crawl.
@normalizedCSName	string		Unique identifier name for the content source that will be crawled.
displayName	string		Display or friendly name for the content source that will be crawled.

Header Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector"
   scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName">
    ...
    <displayName>testSource</displayName>
    ...
  </doc>

Common Staging Repository Configuration

All staging repository connectors support the following configuration properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.

Element	Type	Default	Description
contentSource	string	__DEFAULT__	The content source within the repository that this connector is processing
updates/owner	string		When a comma separated list of owners is specified, as transactions are played back, only transactions that related one of the given owners will be processed. Leave empty to process transactions from all owners
data/owner	string	[item]	When a comma separated list of owners is specified, as transactions are played back, data in the store from each of the specified owners will be attached to the job. You may use the pseudo owners [item] to mean the the owner related to the current transaction or [all] to mean all owners. Leave empty to attach on the data realting to the owner of the transaction being replayed ([item])
forwardClearJobs	boolean	true	By default, any clear jobs of the staging repository will be re-published by this connector, resulting in clear jobs passing along the pipelines to workflow and publisher components. Uncheck this box if you wish to suppress this
url	string		The url of the staging repository. The format will change depending on the staging repository type
domain	string		The domain of the username to use for connections to the staging repository
user	string		The username to use for connections to the staging repository
password	string		The user's password for connections to the staging repository

File System Staging Repository Configuration

In addition to the common configuration, the File System Staging Repository Connector supports the properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.

Element	Type	Default	Description
compress	boolean	false	Set this to true to compress all data and metadata written to the store
fileLock	boolean	false	when false, the File System staging repository will use in memory locking to maintain consistency in the store. If you wish to use the staging repository across JVMs or hosts, set this to true to use file locking
algorithm	string	AES	When a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation
transformation	string	AES	When a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation

Scanner Configuration Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="1" normalizedCSName="StagingToEngine" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="StagingToEngine">
    <connectorSource>
      <url>/repo/SRDemo</url>
      <contentSource>FileToStaging</contentSource>
      <cfgUpdates>false</cfgUpdates>
      <cfgData>false</cfgData>
      <encrypt>false</encrypt>
      <password/>
      <algorithm/>
      <transformation/>
      <forwardClearJobs>true</forwardClearJobs>
      <fileLock>true</fileLock>
    </connectorSource>
    <displayName>StagingToEngine</displayName>
  </doc>

Output

Output from the File System Staging Repository Connector is very dependent on its configuration and the data stored in the repository. The data can consist of the stored item for a single data owner, or a merge of data from multiple data owners

Single owner

<doc>
  <docType>item</docType>
  <url>c:\testdata\11\00\0\1.txt</url>
  <id>c:\testdata\11\00\0\1.txt</id>
  <fetchUrl>file:/c:/testdata/11/00/0/1.txt</fetchUrl>
  <displayUrl>c:\testdata\11\00\0\1.txt</displayUrl>
  <snapshotUrl>004 c:\testdata\11\00\0\1.txt</snapshotUrl>
  <repItemType>aspire/file</repItemType>
  <lastModified>2014-01-09T15:39:11Z</lastModified>
  <dataSize>4264</dataSize>
  <sourceName>FileToStaging</sourceName>
  <sourceType>filesystem</sourceType>
  <connectorSource type="filesystem">
    <url>c:\testdata\11</url>
    <partialScan>false</partialScan>
    <subDirUrl/>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <useACLs>false</useACLs>
    <acls/>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FileToStaging</displayName>
  </connectorSource>
  <hierarchy>
    <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt">
      <ancestors>
        <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/>
        <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/>
        <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/>
      </ancestors>
    </item>
  </hierarchy>
  <protocol source="FetchURLStage/protocol">file</protocol>
  <mimeType source="FetchURLStage/mimeType">text/plain</mimeType>
  <extension source="FetchURLStage">
    <field name="modificationDate">2014-01-09T15:39:11Z</field>
    <field name="content-type">text/plain</field>
    <field name="content-length">4264</field>
    <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field>
  </extension>
  <connectorSource type="FileSystemStagingRepository">
    <url>/repo/SRDemo</url>
    <contentSource>FileToStaging</contentSource>
    <cfgUpdates>false</cfgUpdates>
    <cfgData>false</cfgData>
    <encrypt>false</encrypt>
    <password/>
    <algorithm/>
    <transformation/>
    <forwardClearJobs>true</forwardClearJobs>
    <fileLock>true</fileLock>
    <displayName>StagingToEngine</displayName>
  </connectorSource>
  <action>add</action>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">c:\testdata\11\00\0\1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled .....]]></content>
</doc>

Multiple owner

When outputing data from multiple owners, each tag as an owner attribute added to indicate the source of the data

<doc>
  <docType owner="default">item</docType>
  <url owner="default">c:\testdata\11\00\0\1.txt</url>
  <id owner="default">c:\testdata\11\00\0\1.txt</id>
  <fetchUrl owner="default">file:/c:/testdata/11/00/0/1.txt</fetchUrl>
  <displayUrl owner="default">c:\testdata\11\00\0\1.txt</displayUrl>
  <snapshotUrl owner="default">004 c:\testdata\11\00\0\1.txt</snapshotUrl>
  <repItemType owner="default">aspire/file</repItemType>
  <lastModified owner="default">2014-01-09T15:39:11Z</lastModified>
  <dataSize owner="default">4264</dataSize>
  <sourceName owner="default">FileToStaging</sourceName>
  <sourceType owner="default">filesystem</sourceType>
  <connectorSource owner="default" type="filesystem">
    <url>c:\testdata\11</url>
    <partialScan>false</partialScan>
    <subDirUrl/>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <useACLs>false</useACLs>
    <acls/>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FileToStaging</displayName>
  </connectorSource>
  <hierarchy owner="default">
    <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt">
      <ancestors>
        <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/>
        <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/>
        <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/>
      </ancestors>
    </item>
  </hierarchy>
  <protocol owner="default" source="FetchURLStage/protocol">file</protocol>
  <mimeType owner="default" source="FetchURLStage/mimeType">text/plain</mimeType>
  <extension owner="default" source="FetchURLStage">
    <field name="modificationDate">2014-01-09T15:39:11Z</field>
    <field name="content-type">text/plain</field>
    <field name="content-length">4264</field>
    <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field>
  </extension>
  <connectorSource type="FileSystemStagingRepository">
    <url>/repo/SRDemo</url>
    <contentSource>FileToStaging</contentSource>
    <cfgUpdates>false</cfgUpdates>
    <cfgData>true</cfgData>
    <data>
      <owner>default</owner>
      <owner>bg</owner>
    </data>
    <encrypt>false</encrypt>
    <password/>
    <algorithm/>
    <transformation/>
    <forwardClearJobs>true</forwardClearJobs>
    <fileLock>true</fileLock>
    <displayName>StagingToEngine</displayName>
  </connectorSource>
  <action>add</action>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">c:\testdata\11\00\0\1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled ......]]></content>
</doc>

Page tree