- Created by Johnny Vargas on Jun 28, 2018
The File System Staging Repository Connector component performs full and incremental scans over a file system staging repository. If stores the id of the last transaction processed so that only new transactions are processed on subsequent executions. The transaction contains the URL of the item in the repository and the action of this transaction. The content from the repository is loaded in to AspireObjects attached to Jobs and then submitted to the configured pipeline. Updated content is split into three types -add, update and delete-. Each type of content is published as a different event so that it may be handled by different Aspire pipelines.
The scanner reacts to an incoming job. This job may instruct the scanner to start, stop, pause or resume. Typically the start job will contain all information required by the job to perform the crawl. However, the scanner can be configured with default values via application.xml file. When pausing or stopping, the scanner will wait until all the jobs it published have completed before completing itself.
File System Staging Repository Connector Component | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-file-repo-connector |
subType | default |
Inputs | AspireObject from a content source submitter holding all the information required for a crawl |
Outputs | Jobs from the crawl |
Configuration
This section lists all configuration parameters available to configure the File System Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
Staging Repository Scanner Component Configuration
Staging repository scanners are able to process updates sent over JMS messaging queue as well as those read from the repository. This allows the scanner to receive updates from the publisher and perform near real time updates. In order to receive these updates, the scanner must be configured with the server URL and topic or queue name:
The scanner has a built in ActiveMQ client, but can be configured to connect to any JMS server using JNDI.
The JMS parameters are shown below.
JMS Updates configuration
Element | Type | Default | Description |
---|---|---|---|
updates/@enabled | boolean | false | If true, enable reception of JMS messages in the scanner |
updates/@broker | string | The JMS message broker eg tcp://localhost:61616 | |
updates/@channel | string | The name of the JMS queue or topic to listen on | |
updates/@topic | boolean | false | If true, the channel named in the @channel attribute is a topic. If false, it is a queue. |
updates/@durable | boolean | false | If true, the channel is durable. |
updates/@transacted | boolean | false | Use JMS transactions |
updates/jndi/@enabled | boolean | false | Use JNDI to connect to servers other than ApacheMQ |
updates/jndi/@factory | string | The connection factory to use when using JNDI | |
updates/jndi/classpath | String | The class path for JNDI libraries | |
updates/properties/property/@name | String | The name of a property to pass to the JNDI connection | |
updates/properties/property | String | The property value to pass to the JNDI connection |
File System Staging Repository Connector Specific Configuration
None
Configuration Example
<component factoryName="aspire-file-repo-connector" name="Scanner" subType="default"> <debug>false</debug> <fullRecovery>incremental</fullRecovery> <incrementalRecovery>incremental</incrementalRecovery> <snapshotDir>${data.dir}/StagingToEngine/snapshots</snapshotDir> <waitForSubJobsTimeout>10m</waitForSubJobsTimeout> <emitCrawlStartJobs>false</emitCrawlStartJobs> <emitCrawlEndJobs>false</emitCrawlEndJobs> <enableAuditing>true</enableAuditing> <updates broker="tcp://localhost:61616" channel="demoQueue" enabled="true" topic="true"/> <branches> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onAdd" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onUpdate" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onDelete" pipeline="deletePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" event="onClear" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/> <branch allowRemote="true" event="onCrawlStart" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/> <branch allowRemote="true" event="onCrawlEnd" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
Common Staging Repository Configuration
All staging repository connectors support the following configuration properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
contentSource | string | __DEFAULT__ | The content source within the repository that this connector is processing |
updates/owner | string | When a comma separated list of owners is specified, as transactions are played back, only transactions that related one of the given owners will be processed. Leave empty to process transactions from all owners | |
data/owner | string | [item] | When a comma separated list of owners is specified, as transactions are played back, data in the store from each of the specified owners will be attached to the job. You may use the pseudo owners [item] to mean the the owner related to the current transaction or [all] to mean all owners. Leave empty to attach on the data realting to the owner of the transaction being replayed ([item]) |
forwardClearJobs | boolean | true | By default, any clear jobs of the staging repository will be re-published by this connector, resulting in clear jobs passing along the pipelines to workflow and publisher components. Uncheck this box if you wish to suppress this |
url | string | The url of the staging repository. The format will change depending on the staging repository type | |
domain | string | The domain of the username to use for connections to the staging repository | |
user | string | The username to use for connections to the staging repository | |
password | string | The user's password for connections to the staging repository |
File System Staging Repository Configuration
In addition to the common configuration, the File System Staging Repository Connector supports the properties described in this section relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
compress | boolean | false | Set this to true to compress all data and metadata written to the store |
fileLock | boolean | false | when false, the File System staging repository will use in memory locking to maintain consistency in the store. If you wish to use the staging repository across JVMs or hosts, set this to true to use file locking |
algorithm | string | AES | When a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation |
transformation | string | AES | When a password is set in the common configuration, encrypt all data and metadata written to the store using the given passowrd, algorithm and transformation |
Scanner Configuration Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="1" normalizedCSName="StagingToEngine" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="StagingToEngine"> <connectorSource> <url>/repo/SRDemo</url> <contentSource>FileToStaging</contentSource> <cfgUpdates>false</cfgUpdates> <cfgData>false</cfgData> <encrypt>false</encrypt> <password/> <algorithm/> <transformation/> <forwardClearJobs>true</forwardClearJobs> <fileLock>true</fileLock> </connectorSource> <displayName>StagingToEngine</displayName> </doc>
Output
Output from the File System Staging Repository Connector is very dependent on its configuration and the data stored in the repository. The data can consist of the stored item for a single data owner, or a merge of data from multiple data owners
Single owner
<doc> <docType>item</docType> <url>c:\testdata\11\00\0\1.txt</url> <id>c:\testdata\11\00\0\1.txt</id> <fetchUrl>file:/c:/testdata/11/00/0/1.txt</fetchUrl> <displayUrl>c:\testdata\11\00\0\1.txt</displayUrl> <snapshotUrl>004 c:\testdata\11\00\0\1.txt</snapshotUrl> <repItemType>aspire/file</repItemType> <lastModified>2014-01-09T15:39:11Z</lastModified> <dataSize>4264</dataSize> <sourceName>FileToStaging</sourceName> <sourceType>filesystem</sourceType> <connectorSource type="filesystem"> <url>c:\testdata\11</url> <partialScan>false</partialScan> <subDirUrl/> <indexContainers>false</indexContainers> <scanRecursively>true</scanRecursively> <useACLs>false</useACLs> <acls/> <scanExcludedItems>false</scanExcludedItems> <fileNamePatterns/> <displayName>FileToStaging</displayName> </connectorSource> <hierarchy> <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt"> <ancestors> <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/> <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/> <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/> </ancestors> </item> </hierarchy> <protocol source="FetchURLStage/protocol">file</protocol> <mimeType source="FetchURLStage/mimeType">text/plain</mimeType> <extension source="FetchURLStage"> <field name="modificationDate">2014-01-09T15:39:11Z</field> <field name="content-type">text/plain</field> <field name="content-length">4264</field> <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field> </extension> <connectorSource type="FileSystemStagingRepository"> <url>/repo/SRDemo</url> <contentSource>FileToStaging</contentSource> <cfgUpdates>false</cfgUpdates> <cfgData>false</cfgData> <encrypt>false</encrypt> <password/> <algorithm/> <transformation/> <forwardClearJobs>true</forwardClearJobs> <fileLock>true</fileLock> <displayName>StagingToEngine</displayName> </connectorSource> <action>add</action> <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType> <extension source="ExtractTextStage"> <field name="Content-Encoding">windows-1252</field> <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field> <field name="resourceName">c:\testdata\11\00\0\1.txt</field> </extension> <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled .....]]></content> </doc>
Multiple owner
When outputing data from multiple owners, each tag as an owner attribute added to indicate the source of the data
<doc> <docType owner="default">item</docType> <url owner="default">c:\testdata\11\00\0\1.txt</url> <id owner="default">c:\testdata\11\00\0\1.txt</id> <fetchUrl owner="default">file:/c:/testdata/11/00/0/1.txt</fetchUrl> <displayUrl owner="default">c:\testdata\11\00\0\1.txt</displayUrl> <snapshotUrl owner="default">004 c:\testdata\11\00\0\1.txt</snapshotUrl> <repItemType owner="default">aspire/file</repItemType> <lastModified owner="default">2014-01-09T15:39:11Z</lastModified> <dataSize owner="default">4264</dataSize> <sourceName owner="default">FileToStaging</sourceName> <sourceType owner="default">filesystem</sourceType> <connectorSource owner="default" type="filesystem"> <url>c:\testdata\11</url> <partialScan>false</partialScan> <subDirUrl/> <indexContainers>false</indexContainers> <scanRecursively>true</scanRecursively> <useACLs>false</useACLs> <acls/> <scanExcludedItems>false</scanExcludedItems> <fileNamePatterns/> <displayName>FileToStaging</displayName> </connectorSource> <hierarchy owner="default"> <item id="65616DDA92AAE39EF89209A1DC824E5B" level="4" name="1.txt" url="c:\testdata\11\00\0\1.txt"> <ancestors> <ancestor id="A9633F7A8B463C5FB91EFC29D20B1C8C" level="3" name="0\" parent="true" type="aspire/folder" url="c:\testdata\11\00\0\"/> <ancestor id="E4FF3AB9206EFBD10F9BBB6378144D30" level="2" name="00\" type="aspire/folder" url="c:\testdata\11\00\"/> <ancestor id="C7D8344572B512D71B684BB6FD8EC267" level="1" name="FileToStaging" type="aspire/filesystem" url="c:\testdata\11\"/> </ancestors> </item> </hierarchy> <protocol owner="default" source="FetchURLStage/protocol">file</protocol> <mimeType owner="default" source="FetchURLStage/mimeType">text/plain</mimeType> <extension owner="default" source="FetchURLStage"> <field name="modificationDate">2014-01-09T15:39:11Z</field> <field name="content-type">text/plain</field> <field name="content-length">4264</field> <field name="last-modified">Thu, 09 Jan 2014 15:39:11 GMT</field> </extension> <connectorSource type="FileSystemStagingRepository"> <url>/repo/SRDemo</url> <contentSource>FileToStaging</contentSource> <cfgUpdates>false</cfgUpdates> <cfgData>true</cfgData> <data> <owner>default</owner> <owner>bg</owner> </data> <encrypt>false</encrypt> <password/> <algorithm/> <transformation/> <forwardClearJobs>true</forwardClearJobs> <fileLock>true</fileLock> <displayName>StagingToEngine</displayName> </connectorSource> <action>add</action> <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType> <extension source="ExtractTextStage"> <field name="Content-Encoding">windows-1252</field> <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field> <field name="resourceName">c:\testdata\11\00\0\1.txt</field> </extension> <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled ......]]></content> </doc>
- No labels