Configuration
This section lists all configuration parameters available to configure the SharePoint Online Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
SharePoint Online Specific Configuration
Element | Type | Default | Description |
---|---|---|---|
userName | String | username | The user name to connect to SharePoint with, if one is not given in the control job. |
password | String | secretpassword | The password to connect to SharePoint with, if one is not given in the control job. |
defaultDisplayName | String | SharePointOnline | The name of the crawl, if one is not given in the control job. |
groupPrefixSeparator | String | | | The separator inserted between the site URL and group name when extracting groups from sites. |
snapshotDir | String | . | The directory for snapshot files. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner time out while waiting for published jobs to complete. |
scanRecursively | boolean | false | Indicates whether the child containers should be scanned or not. |
indexContainers | boolean | false | Indicates whether the container items should be indexed or not. |
crawlAttachments | boolean | false | Crawl attachments from list items. E.g. documents attached to an Event. |
crawlExtraSiteCollections | boolean | false | Indicates if the user will crawl more than one site collection. |
subSiteCollections/siteCollectionUrl | string | empty | List of sub site collections to crawl. More than one allowed. |
useLDAPCache | boolean | false | Check for an installed "Aspire LDAP Cache" component for group expansion. |
externalGroupServerPath | string | empty | List of installed "Aspire LDAP Cache" components. |
Configuration Example
<component name="Scanner" subType="default" factoryName="aspire-sharepointonline-scanner"> <debug>${debug}</debug> <groupPrefixSeparator>${groupPrefixSeparator}</groupPrefixSeparator> <snapshotDir>${snapshotDir}</snapshotDir> <scanRecursively>${scanRecursively}</scanRecursively> <indexContainers>${indexContainers}</indexContainers> <crawlAttachments>${crawlAttachments}</crawlAttachments> <useLDAPCache>${useLDAPCache}</useLDAPCache> <externalGroupServerPath>${externalGroupServerPath}</externalGroupServerPath> <crawlExtraSiteCollections>${crawlExtraSiteCollections}</crawlExtraSiteCollections> <subSiteCollections> <siteCollectionUrl>${siteCollectionUrl}</siteCollectionUrl> </subSiteCollections> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
url | string | The URL to scan (allowed http or https). | |
username | string | The username to connect to SharePoint with. | |
password | string | The password to connect to SharePoint with. | |
indexContainers | boolean | false | true if folders (as well as files) should be indexed. |
scanRecursively | boolean | false | true if subfolders of the given URL should be scanned. |
indexContainers | boolean | false | Indicates whether the container items should be indexed or not. |
crawlAttachments | boolean | false | Crawl attachments from list items. E.g. documents attached to an Event. |
crawlExtraSiteCollections | boolean | false | Indicates if the user will crawl more than one site collection. |
subSiteCollections/siteCollectionUrl | string | empty | List of sub site collections to crawl. More than one allowed. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
Content Source Configuration Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="2" jobNumber="5" normalizedCSName="SharePointOnline" scheduleId="2" scheduler="AspireScheduler" sourceName="SharePointOnline"> <connectorSource> <url>http://10.10.21.127/sites/aspire</url> <crawlExtraSiteCollections>true</crawlExtraSiteCollections> <subSiteCollections> <siteCollectionUrl>http://10.10.21.127/sites/lasith</siteCollectionUrl> </subSiteCollections> <domain>qa</domain> <username>sp_farm</username> <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password> <scanRecursively>true</scanRecursively> <indexContainers>true</indexContainers> <crawlAttachments>true</crawlAttachments> <fileNamePatterns/> </connectorSource> <displayName>SharePointOnline</displayName> </doc>
Output
<doc> <url>https://coreteamdev.sharepoint.com/_api/Web</url> <snapshotUrl>001 https://coreteamdev.sharepoint.com/_api/Web</snapshotUrl> <repItemType>aspire/sharePoint</repItemType> <docType>container</docType> <sourceName>o365_SP</sourceName> <sourceType>spOnline</sourceType> <GUID>e8f9fe13-9c6f-443f-8d2e-d28c78e4617e</GUID> <description/> <title>Search Technologies Team Site</title> <lastModified>2015-01-30T17:03:42Z</lastModified> <dataSize>0</dataSize> <displayUrl>https://coreteamdev.sharepoint.com</displayUrl> <id>https://coreteamdev.sharepoint.com/_api/Web</id> <fetchUrl>https://coreteamdev.sharepoint.com</fetchUrl> <connectorSpecific type="spOnline"> <field name="AllowRssFeeds">true</field> <field name="AppInstanceId">00000000-0000-0000-0000-000000000000</field> <field name="Configuration">0</field> <field name="Created">2015-01-13T19:52:07.957</field> <field name="CustomMasterUrl">/_catalogs/masterpage/seattle.master</field> <field name="DocumentLibraryCalloutOfficeWebAppPreviewersDisabled">false</field> <field name="EnableMinimalDownload">true</field> <field name="Id">e8f9fe13-9c6f-443f-8d2e-d28c78e4617e</field> <field name="Language">1033</field> <field name="LastItemModifiedDate">2015-01-30T17:03:42Z</field> <field name="MasterUrl">/_catalogs/masterpage/seattle.master</field> <field name="QuickLaunchEnabled">true</field> <field name="RecycleBinEnabled">true</field> <field name="ServerRelativeUrl">/</field> <field name="SyndicationEnabled">true</field> <field name="Title">Search Technologies Team Site</field> <field name="TreeViewEnabled">false</field> <field name="UIVersion">15</field> <field name="UIVersionConfigurationEnabled">false</field> <field name="Url">https://coreteamdev.sharepoint.com</field> <field name="WebTemplate">STS</field> </connectorSpecific> <acls> <acl Permissions="Read, " access="allow" domain="" entity="group" fullname="C49173E3275E346A38FCF84708A93EE7|Team Site Visitors" name="Team Site Visitors" scope="machine"/> <acl Permissions="Full Control, " access="allow" domain="" entity="group" fullname="C49173E3275E346A38FCF84708A93EE7|Team Site Owners" name="Team Site Owners" scope="machine"/> <acl Permissions="Edit, " access="allow" domain="" entity="group" fullname="C49173E3275E346A38FCF84708A93EE7|Team Site Members" name="Team Site Members" scope="machine"/> <acl Permissions="Read, " access="allow" domain="" entity="user" fullname="[email protected]" name="Julian Ramirez" scope="global"/> <acl Permissions="View Only, " access="deny" domain="" entity="group" fullname="C49173E3275E346A38FCF84708A93EE7|Excel Services Viewers" name="Excel Services Viewers" scope="machine"/> </acls> <hierarchy> <item id="DC660F50ED76AC04EB3E83BB2F674187" level="1" name="Search Technologies Team Site" type="aspire/sharePoint" url="https://coreteamdev.sharepoint.com"/> </hierarchy> <connectorSource type="spOnline"> <url>https://coreteamdev.sharepoint.com</url> <crawlExtraSiteCollections>false</crawlExtraSiteCollections> <subSiteCollections/> <username>[email protected]</username> <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password> <scanRecursively>true</scanRecursively> <indexContainers>true</indexContainers> <crawlAttachments>true</crawlAttachments> <scanExcludedItems>false</scanExcludedItems> <requestProperties/> <fileNamePatterns/> <displayName>o365_SP</displayName> </connectorSource> <action>add</action> <content/> </doc>
Overview
Content Tools