Input Job Parameters
The scanner recognizes the following parameters on the control Job:
Element | Type | Description |
---|---|---|
id | int | The database id from the aspire_content_sources table. |
@crawlId | int | The crawl id from the content database. |
@action | String | The control job action - start, stop, pause or resume. |
@actionProperties | String | Properties for the crawl - full or incremental. |
connectorSource/displayName | String | The name of the crawl. |
connectorSource/Server | String | The URL to Box Server. |
connectorSource/ServerApi | String | The URL to Box API |
connectorSource/ApiVersion | String | The version of the Box API |
connectorSource/ClientId | string | The client id of an app in Box.com with access to the Box account. |
connectorSource/ClientSecret | string | The Client Secret of an app in Box.com with access to the Box account. |
connectorSource/RedirectUrl | string | The a valid URL to redirect the authorization tokens (for authorization process) |
connectorSource/User | String | Login of the Box account. |
connectorSource/Password | string | Password for Box accound. |
connectorSource/PageSize | integer | Indicates the amount of documents or folders that will be returned by the API each call. |
connectorSource/excludeExtensions | string | Indicates the list of extensions (separeted by comma) you don't want to extract the text, for instance dll or exe. |
connectorSource/useImpersonate | boolean | Impersonate each user of Box account in order to crawl all shared and private content. If unchecked, only shared content accessible by the crawling account will be crawled. |
connectorSource/indexContainers | boolean | Indicates if folders (as well as files) should be indexed. |
connectorSource/scanRecursively | boolean | Indicates if subfolders should be scan. |
connectorSource/ExcludeSubFolders/include | string | Optional. A list of folders that can be excluded from the crawling. |
connectorSource/fileNamePatterns/include | regex | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
connectorSource/fileNamePatterns/exclude | regex | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
connectorSource/indexContainers | boolean | Indicates if folders (as well as files) should be indexed.. |
connectorSource/scanRecursively | boolean | Indicates if subfolders of the given URL should be scanned.. |
Configuration
The scanner recognizes the following configuration parameters:
Element | Type | Default | Description |
---|---|---|---|
Server | string | https://app.box.com | The Box server Url. |
Server API Url | string | https://api.box.com | URL for Box API. |
API version | string | 2.0 | The API version that Box is using. |
Client Id | string | none | The client id of an app in Box.com with access to the Box account. |
Client Secret | string | none | The Client Secret of an app in Box.com with access to the Box account. |
Redirect Url | string | https://localhost:4000 | The a valid URL to redirect the authorization tokens (for authorization process) |
User | String | none | Login of the Box account. |
Password | string | none | Password for Box accound. |
PageSize | integer | 100 | Indicates the amount of documents or folders that will be returned by the API each call. |
ExcludeExtensions | string | none | Indicates the list of extensions (separeted by comma) you don't want to extract the text, for instance dll or exe. |
useImpersonate | boolean | false | Impersonate each user of Box account in order to crawl all shared and private content. If unchecked, only shared content accessible by the crawling account will be crawled. |
ExcludeSubFolders/include | string | none | Optional. A list of folders that can be excluded from the crawling. |
fileNamePatterns/include | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/exclude | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
indexContainers | boolean | false | Indicates if folders (as well as files) should be indexed.. |
scanRecursively | boolean | false | Indicates if subfolders of the given URL should be scanned.. |
Branch Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
Example Configuration
<component name="Scanner" subType="scanner" factoryName="aspire-box-scanner"> <debug>${debug}</debug> <metadataMap> <map from="action" to="action" /> <map from="doc-type" to="docType" /> <map from="last-modified-date" to="lastModified" /> <map from="content-length-bytes" to="dataSize" /> <map from="owner" to="owner" /> </metadataMap> <snapshotDir>${snapshotDir}</snapshotDir> <enableAuditing>${enableAuditing}</enableAuditing> <fileNamePatterns> <include pattern=".*" /> <exclude pattern=".*tmp$" /> </fileNamePatterns> <emitCrawlStartJobs>${emitStartJob}</emitCrawlStartJobs> <emitCrawlEndJobs>${emitEndJob}</emitCrawlEndJobs> <!-- Group cache --> <geCache lowercase="${geLowerCase}"> <domain strip="${geStripDomain}" add="${geAddDomain}"/> </geCache> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onCrawlStart" pipelineManager="../ProcessPipelineManager" pipeline="crawlStartEndPipeline" allowRemote="true"/> <branch event="onCrawlEnd" pipelineManager="../ProcessPipelineManager" pipeline="crawlStartEndPipeline" allowRemote="true"/> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" normalizedCSName="testSource1" scheduleId="1"> ... <displayName>testSource</displayName> ... </doc>
Overview
Content Tools