This tutorial walks through the steps necessary to crawl Confluence using the Confluence connector.

Before Beginning: Create a Confluence account

A prerequisite for crawling Confluence is to have an account with admin privileges on the Confluence server you want to crawl.

The domain, username and password will be required later on to configure the connector.

The username should be self explanatory, so, something like "aspire_crawl_account" is recommended.

See Atlassian Confluence Prerequisites section for more details.

Step 1: Launch Aspire and open the Content Source Management Page



Aspire Main Admin Page

Launch Aspire (if it's not already running). See:

Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.

Step 2: Add a new Atlassian Confluence Content Source



Add new source

To specify the location of the store desired to crawl, we will need to create a new "Content Source". To create a new content source:

  1. From the Aspire Home page, click on "Add Source" button.
  2. Click on "Confluence Connector".


Step 2a: Specify Basic Information



General Configuration Tab

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

  1. Enter a content source name in the "Name" field.

    This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.

  2. Click on the "Active?" checkbox to add a checkmark.

    Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the folder will be under maintenance and no crawls are wanted during that period of time.

  3. Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, or Weekly.

    Aspire can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours).For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.

  4. After selecting a Schedule type, specify the details, if applicable:
    1. Manually: No additional options.
    2. Periodically: Specify the "Run every:" options by entering the number of "hours" and "minutes."
    3. Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options.
    4. Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl.
    5. Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *)

Step 2b: Specify the Connector Information



Connector Configuration Tab

In the "Connector" tab, specify the connection information to crawl the Amazon S3 location.

  1. Enter the url of the Confluence site. The server URL where Confluence is installed, it should include the protocol (http or https), the server name or server IP and also the port number if needed (ex: http://myConfluenceSite:8090/). If you are crawling over HTTPS check Crawling Confluence over HTTPS.
  2. Enter the credentials. It needs sufficient access to crawl the whole site.

    DomainUsername Password

  3. Check on the other options as needed:
    1. Use Confluence plugin?: see Confluence ACL Plugin. If unchecked, only files will be indexed.
    2. Include attachments?: Include in the crawl attachments of Pages and Blogs.
    3. Include comments?: Include comments in the content of Pages and Blogs.
    4. Anonymous access allowed?: If anonymous (or public) access is allowed on your Confluence instance, check the "Anonymous access allowed" checkbox. To see if anonymous access is allowed, please see access.
    1. Include/Exclude patterns: Enter regex patterns to include or exclude files/folders based on URL matches.
  1. Select the Confluence version:
    1. 3.5
    2. 4.X
    3. 5.X

Step 2c: Specify Workflow Information



Workflow Configuration Tab

In the "Workflow" tab, specify the workflow steps for the jobs that come out of the crawl.

  1. Drag and drop rules to determine which steps should an item follow after being crawled. This rules could be where to publish the document or transformations needed on the data before sending it to a search engine.

After completing this steps click on the Save button and you'll be sent back to the Home Page.













Step 3: Initiate the Full Crawl



Start Crawl

Now that the content source is set up, the crawl can be initiated.

  1. Click on the crawl type option to set it as "Full" (is set as "Incremental" by default and the first time it'll work like a full crawl. After the first crawl, set it to "Incremental" to crawl for any changes done in the repository).
  2. Click on the Start button.

During the Crawl



Crawl Statistics

During the crawl, you can do the following:

  • Click on the "Refresh" button on the Content Sources page to view the latest status of the crawl.

    The status will show RUNNING while the crawl is going, and CRAWLED when it is finished.

  • Click on "Complete" to view the number of documents crawled so far, the number of documents submitted, and the number of documents with errors.

If there are errors, you will get a clickable "Error" flag that will take you to a detailed error message page.

Advanced Properties (optional)

This section shows how to configure the advance properties of the connector.

  • Groups prefix separator. Prefix used to separate users and groups on ACL's file.
  • Public ACL indicator. ACL used for public content in Confluence



Advanced Properties

Group Expansion

Group expansion configuration is done on the "Advanced Connector Properties" of the Connector tab.

  1. Click on the Advanced Configuration checkbox to enable the advanced properties section.
  2. Scroll down to Use Group Expansion and click the checkbox.
  3. Set an schedule for group expansion refresh and cleanup.
  4. As an optional setting click on the "Use external Group Expansion" checkbox to select an LDAP Cache component for LDAP group expansion. See more info on LDAP Cache
  5. Set the Confluence URL that you want to expand groups.
  6. Set the username of the crawl account.
  7. Set the password of the crawl account.

SSO Authentication (SiteMinder SSO SMSESSION cookie)

  1. Click on the Use SSO authentication checkbox.
  2. Set the SSO server URL.
  3. Set the cookie name.

  • No labels