Configuration
This section lists all configuration parameters available to configure the File System Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
File System Specific Configuration
Element | Type | Default | Description |
---|---|---|---|
timeOut | integer | 10000 | The maximum time the connection wait before give a timeout connection. |
maxRetries | integer | 5 | The maximum number of retries for a connection in case it fails. |
pageSize | integer | 100 | Number of elements that are going to be fetched per call. |
useSecurity | boolean | false | true if the Document Level Security must be fetched. |
useSecurityPlugin | boolean | false | true if the Security Groups ACLs must be fetched. |
useActivity | boolean | false | true if the activity incremental crawl is activated. |
security | boolean | false | true if the security must be fetched. |
incrementalCount | integer | 3 | Indicate how many activity incremental crawls must be performed before a normal incremental is executed. |
expandSecurity | boolean | false | true if the Security Groups must be expanded in the Group Expansion. |
mapsDBDir | string | ${dist.data.dir}/${app.name}/mapsDB | Indicates the path where the Map DBs will be stored |
timestampDir | string | ${dist.data.dir}/${app.name}/timestamp | Indicates the path where the Timestamp will be stored |
Configuration Example
<component name="Scanner" subType="default" factoryName="aspire-jive-connector"> <debug>true</debug> <snapshotDir>${aspire.home}/data/snapshots</snapshotDir> <fileNamePatterns> <include pattern=".*" /> <exclude pattern=".*tmp$" /> </fileNamePatterns> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
url | string | none | The url to the Jive Community. |
username | string | none | The name of the user that is going to be use for the crawl |
password | string | none | The password of the user. |
useSecurity | boolean | false | true if want to fetch the document level security. |
useSecurityPlugin | boolean | false | true if the fetch the security groups ACLs. |
pageSize | integer | 100 | Number of elements that are going to be fetched per call |
timeOut | integer | 5 | Time in seconds before the connection gives a timeout |
maxRetries | integer | 3 | Number of attempts before the connection gives error |
mapsDBDir | string | ${dist.data.dir}/${app.name}/mapsDB | Directory where the mapDBs for the ACLs and the Hierarchy will be placed. |
activityIncremental | boolean | false | true if you want to use the Activity incremental. |
incrementalCount | integer | 5 | How many Activity crawl must be perform in order to do a normal incremental. |
timestampDir | string | ${dist.data.dir}/${app.name}/timestamp | Directory where timestamp will be placed. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/exclude/@pattern | regex | none | Option |
Scanner Configuration Example
<doc action="start" actionProperties="full" normalizedCSName="testFile" scheduleId="1"> <connectorSource> <url>http://searchtechnologies.jive.com</url> <username>Admin</username> <password>encrypted:63AA72A7708C8999DEE56A41894EBEEB</password> <pageSize>100</pageSize> <useSecurity>true</useSecurity> <useSecurityPlugin>true</useSecurityPlugin> <timeOut>5</timeOut> <maxRetries>3</maxRetries> <mapsDBDir>${dist.data.dir}/${app.name}/mapsDB</mapsDBDir> <activityIncremental>true</activityIncremental> <incrementalCount>5</incrementalCount> <timestampDir>${dist.data.dir}/${app.name}/timestamp</timestampDir> <fileNamePatterns> <include pattern=".*place.*"/> <exclude pattern=".*people.*"/> </fileNamePatterns> </connectorSource> <displayName>testFile</displayName> </doc>
Output
<doc> <sourceType>jive</sourceType> <fetchUrl>http://searchtechnologies.jive.com/api/core/v3/places/1000</fetchUrl> <docType>container</docType> <lastModified>2013-04-22T13:56:11Z</lastModified> <dataSize>0</dataSize> <url>http://searchtechnologies.jive.com/api/core/v3/places/1000</url> <crawlId>43</crawlId> <id>1000</id> <displayUrl>http://searchtechnologies.jive.com/community/getting-started</displayUrl> <connectorSpecific type="jive"> <field name="contentsUrl">http://searchtechnologies.jive.com/api/core/v3/contents?filter=place(http%3A%2F%2Fjive-search.com%3A8080%2Fapi%2Fcore%2Fv3%2Fplaces%2F1000)</field> <field name="contentsAllowed">GET</field> <field name="announcementsUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/announcements</field> <field name="announcementsAllowed">GET, POST</field> <field name="categoriesUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/categories</field> <field name="categoriesAllowed">GET, POST</field> <field name="htmlUrl">http://searchtechnologies.jive.com/community/getting-started</field> <field name="htmlAllowed">GET</field> <field name="selfUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000</field> <field name="selfAllowed">DELETE, GET, PUT</field> <field name="placesUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/places</field> <field name="placesAllowed">GET</field> <field name="avatarUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/avatar</field> <field name="avatarAllowed">DELETE, GET, POST</field> <field name="followingInUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/followingIn</field> <field name="followingInAllowed">GET</field> <field name="activityUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/activities</field> <field name="activityAllowed">GET</field> <field name="staticsUrl">http://searchtechnologies.jive.com/api/core/v3/places/1000/statics</field> <field name="staticsAllowed">GET, POST</field> <field name="childCount">0</field> <field name="status">Active</field> <field name="locale">en_US</field> <field name="parent">http://searchtechnologies.jive.com/api/core/v3/places/1002</field> <field name="contentTypes">discussions, documents, files, polls</field> <field name="id">2001</field> <field name="visibleToExternalContributors">false</field> <field name="description">New to Jive SBS? Start here to learn how to get the most out of it.</field> <field name="name">Getting Started</field> <field name="followerCount">0</field> <field name="displayName">getting-started</field> <field name="published">2013-04-08T15:15:06Z</field> <field name="viewCount">0</field> </connectorSpecific> <repItemType>aspire/space</repItemType> <sourceName>Jive Connecotr with Plugin</sourceName> <snapshotUrl>002 http://searchtechnologies.jive.com/api/core/v3/places/1000</snapshotUrl> <action>add</action> <acls> <acl name="All Registered Users" sidType="4" scope="global" entity="group" sid="-2" access="allow"/> <acl name="All Guest Users" sidType="4" scope="global" entity="group" sid="-3" access="allow"/> </acls> <hierarchy> <item name="Getting Started" type="space" url="http://searchtechnologies.jive.com/api/core/v3/places/1000" id="1000" level="2"> <ancestors> <ancestor name="Jive" parent="true" type="space" url="http://searchtechnologies.jive.com/api/core/v3/places/1002" id="1002" level="1"/> </ancestors> </item> </hierarchy> <connectorSource> <url>http://searchtechnologies.jive.com</url> <username>Admin</username> <password>encrypted:63AA72A7708C8999DEE56A41894EBEEB</password> <pageSize>100</pageSize> <useSecurity>true</useSecurity> <useSecurityPlugin>true</useSecurityPlugin> <timeOut>5</timeOut> <maxRetries>3</maxRetries> <mapsDBDir>${dist.data.dir}/${app.name}/mapsDB</mapsDBDir> <activityIncremental>true</activityIncremental> <incrementalCount>5</incrementalCount> <timestampDir>${dist.data.dir}/${app.name}/timestamp</timestampDir> <fileNamePatterns> <include pattern=".*place.*"/> <exclude pattern=".*people.*"/> </fileNamePatterns> </connectorSource> </doc>
Overview
Content Tools