Full Incremental Crawls (Aspire 2)

High Level Workflow of Connectors

The basic workflow of a repository connector is to connect with the repository and either get a list of files with their metadata or crawl the repository for that information and then fetch the required data. This process is performed for both full and incremental crawls. It is important to be careful in using the term “crawl” because it has different meanings depending on where you are in the process.

Repository Scan

A repository scan is a process in which a scanner application goes through the repository content structure gathering information and storing the current relevant change status for each item in the repository in an organized, repeatable and efficient way. When scanning a repository one of four actions is recorded for every item in the repository: added, updated, deleted, not changed. The way in which a scan is performed will depend on the tools provided by the repository to gather relevant information of its items. Repository Crawl A repository crawl is a process in which an application goes through the content of a repository given an item as a starting point and discovering new items through links and references within the items themselves. This is an unstructured and unpredictable process in which the items can be discovered at different times as the item’s information can change to remove or add references to other items. Most popularly used to gather items of web sites.

Hierarchical Scan

A hierarchical scan processes items following the hierarchical structure of the repository in which two types of high level items can be identified: containers and items. A container is a special type of item which, how its name suggests, contains other items. The non-container items will typically be documents and files. A Hierarchical scan will start processing from the top level container and keep discovering container items within top level item and any subsequent container items until all containers and items have been found. The structure of these types of repositories is considered a tree and thus a Breadth-first or Depth-first traversal of a tree algorithm can be used to implement the discovery process of these scanner applications.

Linear Scan

A linear scan processes items in a flat repository structure. Usually table driven repositories. A linear scan will process items based on a query to the table structure in which a sorting mechanism and a modification indicator will provide the scanner application with the information required to keep a state of each item in the repository.

API Based Scan

Certain repositories provide specific APIs to make the scanner creation process a little bit easier. These APIs provide methods to retrieve the required information to get the state of every item in the repository with as few calls as possible. The scan application implementation will rely on the functionality provided by the API to make the scan process as efficient as possible, while keeping track of all items (and its actions: add, update and delete). Although support for these APIs is increasing in new repositories and new versions of existing ones, it is very common that repositories does not have an API to provide this information or they are incomplete for all the information required. This is most apparent for ACL and group information.

Full Scan

A full scan happens when a repository is scanned for the very first time or large changes have occurred within the repository. It involves gathering the information of each and every single item in the repository to create a state in time of the repository and developing a list of content to fetch.

Incremental Scan

An incremental scan happens when a repository is scanned to find the latest changes based on a previous recorded state of the repository. For an incremental scan to happen, a previous scan state is required.

Incremental scan implementations will vary depending on the type of repository scanned:

Hierarchical scanners: the incremental scan process requires for all items to be scanned and compared with their previous state to identify possible add, update or delete actions that happened within the containers.

Linear scanners: the incremental scan process will rely on query filters to retrieve items that have changed since the last recorded state of the repository.
- Important: When deletes cannot be identified with a query filter, the list of all current items is compared to the list of all items since the last recorded state to trigger any delete actions required for any missing item on the current list.

API based scanners: the scanner will rely on the information provided by the repository on any change that has happened in the repository. Information provided will be: for add, update or delete actions since the last recorded scan. In some cases a plug-in to be installed on the repository server is required to provide some of the required information, often for ACL and groups to support document level security.

Hybrid Scan: the scanner will save a change token when the last scan occurred and then use this token for the next incremental scan to ask the repository only for the documents that have changes since the last scan. For this approach, a full scan is performed as a linear or hierarchical scan and the incremental scan based on an API call to the repository.

What does each Aspire connector do?

Aspire Connectors implement different scanner types to retrieve information from the repositories, depending basically on the functionality provided by each type of repository.

Aspire Hierarchical Scanners

CIFS
Confluence
Documentum
eRoom
File System
FTP
Home Page
Lotus Notes
Rightnow
S3
Subversion
Teamforge

Aspire Linear Scanners

Jira
Jive (uses plug-in for security information)
RDBMS with snapshots
Socialcast

API based Scanners

Heritrix
IBM
RDBMS with tables
RSS
Salesforce
SP2010 (uses plug-in for security information)
SP2013

Page tree

Full Incremental Crawls (Aspire 2)

High Level Workflow of Connectors

Repository Scan

API Based Scan

What does each Aspire connector do?