Configuration
This section lists all configuration parameters available to configure the Atlassian Confluence Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
Atlassian Confluence Specific Configuration
Element | Type | Default | Description |
---|---|---|---|
confluenceVersion | string | V3.5 | Indicates the Confluence version that will be crawled. |
cacheTimeout | string | 180000 | Time that the cache memory will be alive. |
numOfSnapshotBackups | string | 10 | Number of snapshots that will be stored as backups. |
groupPrefixSeparator | string | | | Prefix used to separate users and groups on ACL's file. |
confluencePublicAcl | string | confluence-public-acl | ACL that will be used to indicate that a page have public access. |
defaultConfluenceUrl | string | Confluence instance to crawl, | |
defaultDomain | string | The domain to connect to Confluence. | |
defaultUsername | string | The username to connect to Confluence. | |
defaultPassword | string | The password to connect to Confluence. | |
ssoAuthentication | boolean | false | Check this if your confluence authentication is managed by a Single Sign On engine and it uses a cookie based authentication mechanism. |
ssoServer | string | The full URL where we are going to do the request to authenticate. | |
ssoCookie | string | The name of the authentication cookie the system should look for. | |
useGE | boolean | false | Check this if you want to use group expansion. |
useLDAPCache | boolean | false | Check this if you want to use an external server to perform group expansion (Group Expansion Manager is required). |
externalGroupServerPath | select | Indicates the path of the component that gets the external groups from LDAP. |
Configuration Example
<component name="Scanner" subType="${confluenceVersion}" factoryName="aspire-confluence-connector"> <debug>true</debug> <confluencePublicAcl>confluence-public-acl</confluencePublicAcl> <confluenceVersion>v5</confluenceVersion> <snapshotDir>${dist.data.dir}/${app.name}/snapshots</snapshotDir> <groupPrefixSeparator>|</groupPrefixSeparator> <pluginEnabled>false</pluginEnabled> <cacheTimeout>180000</cacheTimeout> <useLDAPCache>false</useLDAPCache> <externalGroupServerPath></externalGroupServerPath> <includeAttachments>true</includeAttachments> <includeComments>true</includeComments> <anonymousAccessAllowed>false</anonymousAccessAllowed> <numOfSnapshotBackups>10</numOfSnapshotBackups> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
url | string | none | The url to the Confluence server. |
domain | string | none | The domain to the Confluence server. |
username | string | none | The name of the user that is going to be used for the crawl |
password | string | none | The password of the user. |
pluginEnabled | boolean | Indicates if the plugin is installed in the Confluence server. | |
includeAttachments | boolean | Indicates if attachments should be included in the crawl. | |
includeComments | boolean | Indicates if comments should be included in the crawl. | |
anonymousAccessAllowed | boolean | Indicates if anonymous access allowed on this instance of Confluence. | |
indexContainers | boolean | true if folders (as well as files) should be indexed. | |
scanRecursively | boolean | Indicates whether the child containers should be scanned or not. | |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/exclude/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
Scanner Configuration Example
<doc action="start" actionProperties="full" normalizedCSName="confluence"> <connectorSource> <url>http://myConfluenceServer:8090/</url> <domain/> <username>admin</username> <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password> <pluginEnabled>false</pluginEnabled> <includeAttachments>true</includeAttachments> <includeComments>true</includeComments> <anonymousAccessAllowed>false</anonymousAccessAllowed> <indexContainers>true</indexContainers> <scanRecursively>true</scanRecursively> <fileNamePatterns> <include pattern=".*place.*"/> <exclude pattern=".*people.*"/> </fileNamePatterns> </connectorSource> <displayName>confluence</displayName> </doc>
Note: To launch a crawl, the job should be sent (processed/enqueued) to the "/ConfluenceConnector/Main" pipeline.
Overview
Content Tools