Configuration
This section lists all configuration parameters available to configure the Amazon S3 Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
Amazon S3 Specific Configuration
Element | Type | Default | Description |
---|
url | String | / | The start url to begin the crawl. |
accessKey | string |
|
| The user access key to connect to Amazon S3, if one is not given in the control job. |
secretKey | string |
|
| The user secret key to connect to Amazon S3, if one is not given in the control job. |
scanSubFolders | boolean |
|
| Indicates whether the child containers should be scanned or not. |
indexFolders | boolean |
|
| Indicates whether the container items (folders and buckets) should be indexed or not. |
Configuration Example
<component name="Scanner" subType="default" factoryName="aspire-s3-connector">
<debug>${debug}</debug>
<snapshotDir>${snapshotDir}</snapshotDir>
<branches>
<branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true"
batchSize="50" batchTimeout="60000" simultaneousBatches="2" />
<branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true"
batchSize="50" batchTimeout="60000" simultaneousBatches="2" />
<branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true"
batchSize="50" batchTimeout="60000" simultaneousBatches="2" />
</branches>
</component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string |
|
| Unique identifier name for the content source that will be crawled. |
displayName | string |
|
| Display or friendly name for the content source that will be crawled. |
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector"
scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName">
...
<displayName>testSource</displayName>
...
</doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|
url | string |
|
| The URL to scan. |
accessKey | string |
|
| The access key of the Amazon S3 account. |
secretKey | string |
|
| The secret key of the Amazon S3 account. |
indexFolders | string |
|
| true if folders (as well as files) should be indexed. |
scanSubFolders | string |
|
| true if subfolders of the given URL should be scanned. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
Scanner Configuration Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="1" jobNumber="0" normalizedCSName="amazonS3"
scheduleId="1" scheduler="AspireScheduler" sourceName="amazonS3">
<connectorSource>
<url>/my-first-s3-bucket-1-0000000001/</url>
<accessKey>myAccessKey</accessKey>
<secretKey>mySecretKey</secretKey>
<indexFolders>true</indexFolders>
<scanSubFolders>true</scanSubFolders>
<fileNamePatterns/>
</connectorSource>
<displayName>amazonS3</displayName>
</doc>
Output
<doc>
<url>/my-first-s3-bucket-1-0000000001/</url>
<id>/my-first-s3-bucket-1-0000000001/</id>
<fetchUrl>/my-first-s3-bucket-1-0000000001/</fetchUrl>
<repItemType>aspire/bucket</repItemType>
<docType>container</docType>
<snapshotUrl>/my-first-s3-bucket-1-0000000001/</snapshotUrl>
<displayUrl/>
<owner>andresau</owner>
<lastModified>2013-11-06T03:51:55Z</lastModified>
<acls>
<acl access="allow" domain="my-first-s3-bucket-1-0000000001" entity="group" fullname="my-first-s3-bucket-1-0000000001\35b775a3073908cd529e174c2c59bd502c5eb986c5406029c2ced70b4e0ea4a7" name="35b775a3073908cd529e174c2c59bd502c5eb986c5406029c2ced70b4e0ea4a7" scope="global"/>
<acl access="allow" domain="my-first-s3-bucket-1-0000000001" entity="group" fullname="my-first-s3-bucket-1-0000000001\41f7bbb1645b2b2a1d2134266f99695fc44e4735ca3725b457e373adcf31d9f0" name="41f7bbb1645b2b2a1d2134266f99695fc44e4735ca3725b457e373adcf31d9f0" scope="global"/>
</acls>
<sourceName>amazonS3</sourceName>
<sourceType>s3</sourceType>
<connectorSource>
<url>/my-first-s3-bucket-1-0000000001/</url>
<accessKey>AKIAIQRRCLVVIKV4ZYHQ</accessKey>
<secretKey>encrypted:DB77B9869844B3651094CB293E842BD25E8F942CA94A09FE9D078A7AE762FB05F00A0929E7EDA0592CF73A47891DF3C3</secretKey>
<indexFolders>true</indexFolders>
<scanSubFolders>true</scanSubFolders>
<fileNamePatterns/>
<displayName>amazonS3</displayName>
</connectorSource>
<action>add</action>
<hierarchy>
<item id="2DEC5E0DACC737196DDA0C7ADA787EAF" level="2" name="my-first-s3-bucket-1-0000000001" type="aspire/bucket" url="/my-first-s3-bucket-1-0000000001/">
<ancestors>
<ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" parent="true" type="aspire/server" url="/"/>
</ancestors>
</item>
<itemType>container</itemType>
</hierarchy>
<content>my-first-s3-bucket-1-0000000001</content>
</doc>