The File System Staging Repository Connector component performs full and incremental scans over a file system staging repository. If stores the id of the last transaction processed so that only new transactions are processed on subsequent executions. The transaction contains the URL of the item in the repository and the action of this transaction. The content from the repository is loaded in to AspireObjects attached to Jobs and then submitted to the configured pipeline. Updated content is split into three types -add, update and delete-. Each type of content is published as a different event so that it may be handled by different Aspire pipelines.

The scanner reacts to an incoming job. This job may instruct the scanner to start, stop, pause or resume. Typically the start job will contain all information required by the job to perform the crawl. However, the scanner can be configured with default values via application.xml file. When pausing or stopping, the scanner will wait until all the jobs it published have completed before completing itself.


  


File System Staging Repository Connector Component
Factory Name com.searchtechnologies.aspire:aspire-file-repo-connector
subType default
Inputs AspireObject from a content source submitter holding all the information required for a crawl
Outputs Jobs from the crawl

Configuration

This section lists all configuration parameters available to configure the File System Scanner component.

General Scanner Component Configuration

Basic Scanner Configuration

ElementTypeDefaultDescription
snapshotDirStringsnapshotsThe directory for snapshot files.
numOfSnapshotBackupsint2The number of snapshots to keep after processing.
waitForSubJobsTimeoutlong600000
(=10 mins)
Scanner timeout while waiting for published jobs to complete.
maxOutstandingTimeStatisticslong1mThe max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file.
maxOutstandingUpdatesStatisticslong1000The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file.
usesDomainbooleantrueIndicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander).

Branch Handler Configuration

This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.

ElementTypeDescription
branches/branch/@eventstringThe event to configure - onAdd, onDelete or onUpdate.
branches/branch/@pipelineManagerstringThe name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelinestringThe name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.
branches/branch/@allowRemotebooleanIndicates if this pipeline can be found on remote servers (see Distributed Processing for details).
branches/branch/@batchingbooleanIndicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing).
branches/branch/@batchSizeintThe max size of the batches that the branch handler will created.
branches/branch/@batchTimeoutlongTime to wait before the batch is closed if the batchSize hasn't been reached.
branches/branch/@simultaneousBatchesintThe max number of simultanous batches that will be handled by the branch handler.

Staging Repository Scanner Component Configuration

Staging repository scanners are able to process updates sent over JMS messaging queue as well as those read from the repository. This allows the scanner to receive updates from the publisher and perform near real time updates. In order to receive these updates, the scanner must be configured with the server URL and topic or queue name:

The scanner has a built in ActiveMQ client, but can be configured to connect to any JMS server using JNDI.

The JMS parameters are shown below.

JMS Updates configuration

ElementTypeDefaultDescription
updates/@enabledbooleanfalseIf true, enable reception of JMS messages in the scanner
updates/@brokerstring
The JMS message broker eg tcp://localhost:61616
updates/@channelstring
The name of the JMS queue or topic to listen on
updates/@topicbooleanfalseIf true, the channel named in the @channel attribute is a topic. If false, it is a queue.
updates/@durablebooleanfalseIf true, the channel is durable.
updates/@transactedbooleanfalseUse JMS transactions
updates/jndi/@enabledbooleanfalseUse JNDI to connect to servers other than ApacheMQ
updates/jndi/@factorystring
The connection factory to use when using JNDI
updates/jndi/classpathString
The class path for JNDI libraries
updates/properties/property/@nameString
The name of a property to pass to the JNDI connection
updates/properties/propertyString
The property value to pass to the JNDI connection

File System Staging Repository Connector Specific Configuration

None

Configuration Example

  <component factoryName="aspire-file-repo-connector" name="Scanner" subType="default">
    <debug>false</debug>
    <fullRecovery>incremental</fullRecovery>
    <incrementalRecovery>incremental</incrementalRecovery>
    <snapshotDir>${data.dir}/StagingToEngine/snapshots</snapshotDir>
    <waitForSubJobsTimeout>10m</waitForSubJobsTimeout>
    <emitCrawlStartJobs>false</emitCrawlStartJobs>
    <emitCrawlEndJobs>false</emitCrawlEndJobs>
    <enableAuditing>true</enableAuditing>
    <updates broker="tcp://localhost:61616" channel="demoQueue" enabled="true" topic="true"/>
    <branches>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onAdd" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onUpdate" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onDelete" pipeline="deletePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" event="onClear" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
      <branch allowRemote="true" event="onCrawlStart" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
      <branch allowRemote="true" event="onCrawlEnd" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/>
    </branches>
  </component>


Source Configuration

Scanner Control Configuration

The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.

ElementTypeOptionsDescription
@actionstringstart, stop, pause, resume, abortControl command to tell the scanner which operation to perform. Use start option to launch a new crawl.
@actionPropertiesstringfull, incrementalWhen a start @action is received, it will tell the scanner to either run a full or an incremental crawl.
@normalizedCSNamestring
Unique identifier name for the content source that will be crawled.
displayNamestring
Display or friendly name for the content source that will be crawled.

Header Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector"
   scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName">
    ...
    <displayName>testSource</displayName>
    ...
  </doc>

Common Staging Repository Configuration

All staging repository connectors support the following configuration properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.

ElementTypeDefaultDescription
contentSourcestring__DEFAULT__The content source within the repository that this connector is processing
updates/ownerstring
When a comma separated list of owners is specified, as transactions are played back, only transactions that related one of the given owners will be processed. Leave empty to process transactions from all owners
data/ownerstring[item]When a comma separated list of owners is specified, as transactions are played back, data in the store from each of the specified owners will be attached to the job. You may use the pseudo owners [item] to mean the the owner related to the current transaction or [all] to mean all owners. Leave empty to attach on the data realting to the owner of the transaction being replayed ([item])
forwardClearJobsbooleantrueBy default, any clear jobs of the staging repository will be re-published by this connector, resulting in clear jobs passing along the pipelines to workflow and publisher components. Uncheck this box if you wish to suppress this
urlstring
The url of the staging repository. The format will change depending on the staging repository type
domainstring
The domain of the username to use for connections to the staging repository
userstring
The username to use for connections to the staging repository
passwordstring
The user's password for connections to the staging repository

File System Staging Repository Configuration

In addition to the common configuration, the File System Staging Repository Connector supports the properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.

ElementTypeDefaultDescription
compressbooleanfalseSet this to true to compress all data and metadata written to the store
fileLockbooleanfalsewhen false, the File System staging repository will use in memory locking to maintain consistency in the store. If you wish to use the staging repository across JVMs or hosts, set this to true to use file locking
algorithmstringAESWhen a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation
transformationstringAESWhen a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation

Scanner Configuration Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="1" normalizedCSName="StagingToEngine" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="StagingToEngine">
    <connectorSource>
      <url>/repo/SRDemo</url>
      <contentSource>FileToStaging</contentSource>
      <cfgUpdates>false</cfgUpdates>
      <cfgData>false</cfgData>
      <encrypt>false</encrypt>
      <password/>
      <algorithm/>
      <transformation/>
      <forwardClearJobs>true</forwardClearJobs>
      <fileLock>true</fileLock>
    </connectorSource>
    <displayName>StagingToEngine</displayName>
  </doc>


Output

Output from the File System Staging Repository Connector is very dependent on its configuration and the data stored in the repository. The data can consist of the stored item for a single data owner, or a merge of data from multiple data owners

Single owner

<doc>
  <docType>item</docType>
  <url>c:\testdata\11\00\0\1.txt</url>
  <id>c:\testdata\11\00\0\1.txt</id>
  <fetchUrl>file:/c:/testdata/11/00/0/1.txt</fetchUrl>
  <displayUrl>c:\testdata\11\00\0\1.txt</displayUrl>
  <snapshotUrl>004 c:\testdata\11\00\0\1.txt</snapshotUrl>
  <repItemType>aspire/file</repItemType>
  <lastModified>2014-01-09T15:39:11Z</lastModified>
  <dataSize>4264</dataSize>
  <sourceName>FileToStaging</sourceName>
  <sourceType>filesystem</sourceType>
  <connectorSource type="filesystem">
    <url>c:\testdata\11</url>
    <partialScan>false</partialScan>
    <subDirUrl/>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <useACLs>false</useACLs>
    <acls/>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FileToStaging</displayName>
  </connectorSource>
  <hierarchy>
    <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt">
      <ancestors>
        <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/>
        <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/>
        <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/>
      </ancestors>
    </item>
  </hierarchy>
  <protocol source="FetchURLStage/protocol">file</protocol>
  <mimeType source="FetchURLStage/mimeType">text/plain</mimeType>
  <extension source="FetchURLStage">
    <field name="modificationDate">2014-01-09T15:39:11Z</field>
    <field name="content-type">text/plain</field>
    <field name="content-length">4264</field>
    <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field>
  </extension>
  <connectorSource type="FileSystemStagingRepository">
    <url>/repo/SRDemo</url>
    <contentSource>FileToStaging</contentSource>
    <cfgUpdates>false</cfgUpdates>
    <cfgData>false</cfgData>
    <encrypt>false</encrypt>
    <password/>
    <algorithm/>
    <transformation/>
    <forwardClearJobs>true</forwardClearJobs>
    <fileLock>true</fileLock>
    <displayName>StagingToEngine</displayName>
  </connectorSource>
  <action>add</action>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">c:\testdata\11\00\0\1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled .....]]></content>
</doc>

Multiple owner

When outputing data from multiple owners, each tag as an owner attribute added to indicate the source of the data

<doc>
  <docType owner="default">item</docType>
  <url owner="default">c:\testdata\11\00\0\1.txt</url>
  <id owner="default">c:\testdata\11\00\0\1.txt</id>
  <fetchUrl owner="default">file:/c:/testdata/11/00/0/1.txt</fetchUrl>
  <displayUrl owner="default">c:\testdata\11\00\0\1.txt</displayUrl>
  <snapshotUrl owner="default">004 c:\testdata\11\00\0\1.txt</snapshotUrl>
  <repItemType owner="default">aspire/file</repItemType>
  <lastModified owner="default">2014-01-09T15:39:11Z</lastModified>
  <dataSize owner="default">4264</dataSize>
  <sourceName owner="default">FileToStaging</sourceName>
  <sourceType owner="default">filesystem</sourceType>
  <connectorSource owner="default" type="filesystem">
    <url>c:\testdata\11</url>
    <partialScan>false</partialScan>
    <subDirUrl/>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <useACLs>false</useACLs>
    <acls/>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FileToStaging</displayName>
  </connectorSource>
  <hierarchy owner="default">
    <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt">
      <ancestors>
        <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/>
        <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/>
        <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/>
      </ancestors>
    </item>
  </hierarchy>
  <protocol owner="default" source="FetchURLStage/protocol">file</protocol>
  <mimeType owner="default" source="FetchURLStage/mimeType">text/plain</mimeType>
  <extension owner="default" source="FetchURLStage">
    <field name="modificationDate">2014-01-09T15:39:11Z</field>
    <field name="content-type">text/plain</field>
    <field name="content-length">4264</field>
    <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field>
  </extension>
  <connectorSource type="FileSystemStagingRepository">
    <url>/repo/SRDemo</url>
    <contentSource>FileToStaging</contentSource>
    <cfgUpdates>false</cfgUpdates>
    <cfgData>true</cfgData>
    <data>
      <owner>default</owner>
      <owner>bg</owner>
    </data>
    <encrypt>false</encrypt>
    <password/>
    <algorithm/>
    <transformation/>
    <forwardClearJobs>true</forwardClearJobs>
    <fileLock>true</fileLock>
    <displayName>StagingToEngine</displayName>
  </connectorSource>
  <action>add</action>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">c:\testdata\11\00\0\1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled ......]]></content>
</doc>


  • No labels