The Documentum Scanner component performs full and incremental scans over a Documentum repository, maintaining a snapshot of the repository, and comparing it with the current content to establish which content has been updated. Updated content is then submitted to the configured pipeline in AspireObjects attached to Jobs. Including the URL of the changed item, the AspireObject also contains metadata extracted from the repository. Updated content is split in to three types - add, update and delete. Each type of content is published on a different event so that it can be handled by different Aspire pipelines
The scanner reacts to an incoming job. This job may instruct the scanner to start, stop, pause or resume. Typically, the start job contains all of the information required by the job to perform the crawl. However, the scanner can be configured with default values via application.xml file. When pausing or stopping, the scanner waits until all the jobs it published have completed before completing.
RDB via Snapshots Scanner | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-documentum-connector |
subType | defaut |
Inputs | AspireObject from a content source submitter holding all the information required for a crawl |
Outputs | Jobs from the crawl |
This section lists all configuration parameters available to configure the Documentum Scanner component.
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
The scanner recognizes the following configuration parameters:
Element | Type | Default | Description |
---|---|---|---|
url | String | The URL to crawl. | |
username | String | The username to use when accessing Documentum. | |
password | String | The password to use when accessing Documentum. | |
dfcPropsFilePath | String | The location of the DFC properties file. You must copy the dfc.keystore file to the location specified in the dfc.properties file as well. | |
webtopUrl | String | The URL to access the Webtop interface. This string is prefixed to each object path so it can be accessed through an URL. | |
maxFileSize | int | unlimited | The limit size in MB of the content to be crawled, or unlimited to extract the whole content. |
usePrefix | boolean | false | When doing group expansion, the component will return the groups with a predefined prefix in the form of PREFIX@group. |
scanSystemCabinets | boolean | false | true if hidden and private cabinets of Documentum should be scanned. |
<component name="scanner" factoryName="aspire-documentum-connector" subType="default"> <username>admin</username> <password>admin</password> <dfcPropsFilePath>C:/Documentum/config/dfc.properties</dfcPropsFilePath> <webTopUrl>http:/localhost:9080/webtop</webTopUrl> <debug>true</debug> <snapshotDir>${aspire.home}/data/snapshots</snapshotDir> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="addUpdatePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="deletePipeline" allowRemote="true" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> </branches> </component>
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Element | Type | Default | Description |
---|---|---|---|
url | string | The Documentum URL to scan. The format of the dctm url is as follows: dctm://server:port/docbase/cabinet/folder, where the port is optional and the URL requires at least up to docbase. | |
username | string | The username to use when connecting to Documentum. | |
password | string | The password to use when connecting to Documentum. | |
indexContainers | boolean | false | true if folders (as well as files) should be sent to the pipeline. |
scanRecursively | boolean | false | true if sub folders of the given URL should be scanned. |
scanSystemCabinets | boolean | false | true if private and system cabinets should be scanned. |
maxFileSize | long | The limit size in MB of the content to be crawled, or unlimited if the whole file should be extracted. | |
webtopUrl | string | The URL to access the Webtop interface. This will be prefixed to each object path so it can be accessed through a URL. | |
dfcPropsFilePath | string | The location of the DFC properties file. |
<doc action="start" actionProperties="full" normalizedCSName="cifsTest"> <connectorSource> <url>dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB</url> <username>Administrator</username> <password>pass1234</password> <dfcPropsFilePath>config/dfc.properties</dfcPropsFilePath> <webtopUrl>http://10.10.21.73:9080/webtop/objectId=</webtopUrl> <maxFileSize>Unlimited</maxFileSize> <indexContainers>true</indexContainers> <scanRecursively>true</scanRecursively> <scanSystemCabinets>false</scanSystemCabinets> <fileNamePatterns> <include pattern=".*LSA.*"/> <exclude pattern=".*tmp.*"/> </fileNamePatterns> </connectorSource> <displayName>documentum</displayName> </doc>
<doc> <url>dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/folder-1-3/dm_document-0024.txt</url> <fetchUrl>dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/folder-1-3/dm_document-0024.txt</fetchUrl> <snapshotUrl>006 dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/folder-1-3/dm_document-0024.txt</snapshotUrl> <docType>item</docType> <id>090010e18001d15c</id> <connectorSpecific type="documentum"> <field name="object_name">dm_document-0024.txt</field> <field name="r_object_type">dm_document</field> <field name="r_creation_date">12/5/2013 2:13:59 PM</field> <field name="r_modify_date">12/5/2013 2:13:59 PM</field> <field name="r_modifier">Administrator</field> <field name="r_access_date">1/22/2014 2:42:11 PM</field> <field name="a_is_hidden">F</field> <field name="i_is_deleted">F</field> <field name="a_retention_date">nulldate</field> <field name="a_archive">F</field> <field name="a_link_resolved">F</field> <field name="i_reference_cnt">1</field> <field name="i_has_folder">T</field> <field name="i_folder_id">0b0010e18001d14b</field> <field name="r_link_cnt">0</field> <field name="r_link_high_cnt">0</field> <field name="r_assembled_from_id">0000000000000000</field> <field name="r_frzn_assembly_cnt">0</field> <field name="r_has_frzn_assembly">F</field> <field name="r_is_virtual_doc">0</field> <field name="i_contents_id">060010e18001c31b</field> <field name="a_content_type">crtext</field> <field name="r_page_cnt">1</field> <field name="r_content_size">511425</field> <field name="a_full_text">T</field> <field name="a_storage_type">filestore_01</field> <field name="i_cabinet_id">0c0010e1800175ca</field> <field name="owner_name">Administrator</field> <field name="owner_permit">7</field> <field name="group_name">docu</field> <field name="group_permit">5</field> <field name="world_permit">3</field> <field name="i_antecedent_id">0000000000000000</field> <field name="i_chronicle_id">090010e18001d15c</field> <field name="i_latest_flag">T</field> <field name="r_lock_date">nulldate</field> <field name="r_version_label">1.0,CURRENT</field> <field name="i_branch_cnt">0</field> <field name="i_direct_dsc">F</field> <field name="r_immutable_flag">F</field> <field name="r_frozen_flag">F</field> <field name="r_has_events">F</field> <field name="acl_domain">Administrator</field> <field name="acl_name">dm_450010e180000101</field> <field name="i_is_reference">F</field> <field name="r_creator_name">Administrator</field> <field name="r_is_public">T</field> <field name="r_policy_id">0000000000000000</field> <field name="r_resume_state">0</field> <field name="r_current_state">0</field> <field name="r_alias_set_id">0000000000000000</field> <field name="a_is_template">F</field> <field name="r_full_content_size">511425</field> <field name="a_is_signed">F</field> <field name="a_last_review_date">nulldate</field> <field name="i_retain_until">nulldate</field> <field name="i_partition">0</field> <field name="i_is_replica">F</field> <field name="i_vstamp">0</field> </connectorSpecific> <lastModified>2013-12-05T20:13:59Z</lastModified> <modifiedBy>Administrator</modifiedBy> <dataSize>511425</dataSize> <owner>Administrator</owner> <createdBy>Administrator</createdBy> <repItemType>aspire/dm_document</repItemType> <displayUrl>http://10.10.21.73:9080/webtop/objectId=090010e18001d15c</displayUrl> <acls> <acl access="allow" domain="dctm://10.10.21.73:1489/DocumentumRepository" entity="group" fullname="dctm://10.10.21.73:1489/DocumentumRepository@dm_world" name="dm_world" scope="global"/> <acl access="allow" domain="dctm://10.10.21.73:1489/DocumentumRepository" entity="group" fullname="dctm://10.10.21.73:1489/DocumentumRepository@Administrator" name="Administrator" scope="global"/> <acl access="allow" domain="dctm://10.10.21.73:1489/DocumentumRepository" entity="group" fullname="dctm://10.10.21.73:1489/DocumentumRepository@docu" name="docu" scope="global"/> </acls> <sourceName>documentum</sourceName> <sourceType>documentum</sourceType> <connectorSource> <url>dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB</url> <username>Administrator</username> <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password> <dfcPropsFilePath>config/dfc.properties</dfcPropsFilePath> <webtopUrl>http://10.10.21.73:9080/webtop/objectId=</webtopUrl> <maxFileSize>Unlimited</maxFileSize> <indexContainers>true</indexContainers> <scanRecursively>true</scanRecursively> <scanSystemCabinets>false</scanSystemCabinets> <fileNamePatterns/> <docbase>DocumentumRepository</docbase> <host>10.10.21.73</host> <port>1489</port> <displayName>documentum</displayName> </connectorSource> <action>add</action> <hierarchy> <item id="246AEB4224DF69E86C83D5AFC357A3FD" level="6" name="dm_document-0024.txt" url="dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/folder-1-3/dm_document-0024.txt"> <ancestors> <ancestor id="6846F1C598D288AC85E2DDE6F178B488" level="5" name="folder-1-3" parent="true" type="aspire/dm_folder" url="dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/folder-1-3/"/> <ancestor id="2760A49EEC1E0C469929E66A909BBCAA" level="4" name="folder-1" type="aspire/dm_folder" url="dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/folder-1/"/> <ancestor id="2F224B8B2365BE6BFD1554713BFFF190" level="3" name="5000-500KB" type="aspire/dm_cabinet" url="dctm://10.10.21.73:1489/DocumentumRepository/5000-500KB/"/> <ancestor id="A5ECC7C3A8738BB297CC336536AD3B60" level="2" name="DocumentumRepository" type="aspire/docbase" url="dctm://10.10.21.73:1489/DocumentumRepository/"/> <ancestor id="93EAEBDC4E5AF3FFF212F8E80902AF01" level="1" name="documentum" type="aspire/documentum" url="dctm://10.10.21.73:1489/"/> </ancestors> </item> </hierarchy> </doc>