Features

The SharePoint connector will crawl content from any SharePoint site collection URL that you specify. The connector will retrieve Sites, Lists, Folders, and List Items, as well as other pages (in .aspx format).

The connector uses web services to access SharePoint database(s) directly; it doesn't do web crawling. Some of the features of the SharePoint connector include:

  • Performs incremental crawling (so that only new/updated documents are indexed)
  • Fetches access control lists (ACLs) for document level security
  • Is search engine independent
  • Runs from any machine with access to the given SharePoint URLs
  • Supports NTLM and HTTPs
  • Supports site discovery
  • Supports Claims Users and Groups
  • Designed for supporting early binding mechanisms
  • Optionally, can run without installing anything on SharePoint (with important limitations)
  • Regular expression patterns for files to include / exclude

Future Development Plan

The following features are not currently implemented, but are on the development plan:

  • Automatic metadata propagation

    From site "about" pages to all of the files within the site

  • Index SharePoint list items attachments
  • Index and support people search

Anything we should add? Please let us know.

SharePoint Architecture

Find detailed information on MSDN article.

Summary of SharePoint organization

This is the hierarchy of processes/applications/sites/sub-sites/libraries/folders/and documents within Sharepoint 2010.

  • Sharepoint Server
    • Sharepoint Web Application Pool
      • Sharepoint Web Application (single web application)
        • Main Site Collection (the primary or main site created for the web application, associated with the primary http://xyz.server.com URL)
          • Sub Sites
            • Document Libraries
              • Folders
                • Documents
                  • Attachments
        • Other Site Collections
          • Sub Sites
            • Document Libraries
              • Folders
                • Documents
                  • Attachments

Content Retrieved by the Connector

The SharePoint connector will retrieve the following objects:

  • Sites
  • Lists
  • Folders
  • Documents or List Items
  • Attachments

ListItems can take a number of different formats. For example, documents (pdf, doc, ppt, etc), calendar events or announcements. For more info on how ListItems content types work go to the MSDN article



Operation Mode

The connector will use SOAP web services over HTTP or HTTPs to acquire information of SharePoint content. The consumed web services are either standard web services provided by SharePoint or Search Technologies web services extension. The later are web services optionally deployed in the SharePoint server that enable additional features on the connector.

The set of standard web services provided by SharePoint grant access to information on sites, lists, folders and documents. However, there are certain limitations with them:

  • You can't find information of web applications or site collections (no site discovery).
  • They don't provide item level security information, i.e., permissions assigned to folders or list items (such as documents).
  • There is no way to find out managed paths of a site collection

Search Technologies web services extension can be installed on SharePoint servers to overcome these limitations with the standard set of web services.

The connector acquires content by doing the following:

  • Go recursively through all sites, subsites, lists, folders and documents and creates sub-jobs for each object discovered. Each sub-job contains all metadata available, including ACLs.
  • Saves a snapshot file to compare previous item states and do incremental crawls with added, updated and deleted items.


  • No labels