Crawling TeamForge Tutorial (Aspire 2)

This tutorial walks through the steps necessary to crawl a TeamForge repository using the Aspire TeamForge connector.

Before Beginning: Add/Create User Account

A prerequisite for crawling TeamForge is to have a TeamForge repository account. This account should have sufficient permissions to crawl documents for indexing. So it should have access to all the users, groups and roles that related to the items you want to crawl. Yo may add an existing user account or create new one. If you are creating new one the recommended name for this account is "aspire_crawl_account" or something similar.

The username and password for this account will be required below.

Step 1: Set TeamForge repository access Rights

The "aspire_crawl_account" will need to have sufficient access rights to read all of the items in TeamForge repository that you wish to crawl.

To set the rights for your "aspire_crawl_account", do the following:

Log into the TeamForge as an administrator.
Click on 'Admin' menu item in the site navigation bar. Then on the site administration navigation bar, click Users.
Click the drop-down arrow next to Create and click Single User.
Fill the required information and give "aspire_crawl_account" as the user name also check the 'site admin' check box.
Click create to create the user.
For more information about creating users in TeamForge please refer.

You will need this login information later in these procedures, when entering properties for your TeamForge Connector.

Step 2: Install the TeamForge Connector App into Aspire 2

To specify exactly what TeamForge repository to crawl, we will need to create a new "Content Source".

Aspire 2.0 Home Page

Launch Aspire(if it's not already running). See: Launching Aspire
Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.
From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), click on the "[Add Source]" button and select TeamForge connector from the list.

Step 2a: Specify Basic Information

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

Enter a content source name in the "Name" field.
This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.
Click on the "Active?" checkbox to add a checkmark.
Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the source is unavailable.
Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, Weekly or Advanced.
This can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.
After selecting a Schedule type, specify the details, if applicable:
Manually: No additional options. Periodically: Specify the "Run job every:" options by entering the number of "hours" and "minutes." Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options. Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl. Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *).

Step 2b: Specify connector properties

Connector properties

In the "TeamForge URL" field in the ' Content Source Properties' section, enter the TeamForge Community URL to crawl.
Most include the respective protocol, http or https.
Enter the Username and Password of the crawl account you created earlier; it needs sufficient access to crawl the TeamForge Community that you specified.
Note: The password will be automatically encrypted by Aspire.
Check the "Index containers" checkbox so that the crawl accesses all folder type items.
Include/Exclude patterns:
If you want to specify include patterns, click on the 'add new' button for include patterns and specify the regex pattern. So Aspire will only crawl URLs with the specified pattern. If you want to specify exclude patterns, click on the 'add new' button for exclude patterns and specify the regex pattern. So Aspire will exclude crawling of URLs that matches the specified pattern.
Under advanced configuration specify value for 'Session timeout' if you want to change the default value which is 10 minutes. This will use as the session timeout value when downloading attachments from TeamForge repository.

Step 2c: Specify Workflow Information

Aspire 2.0 Workflow Information

In the "Workflow" tab, specify the workflow steps for the jobs that come out of the crawl.

Drag and drop rules to determine which steps should an item follow after being crawled. This rules could be where to publish the document or transformations needed on the data before sending it to a search engine. See Workflow for more information.

After completing this steps click on the Save button and you'll be sent back to the Home Page.

Step 3: Initiate the Full Crawl

Now that everything is set up, actually initiating the crawl is easy.

From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), you can see the newly created TeamForge connector.
Make sure connector is active and crawl type is full(click on the crawl type link if it is incremental or test until it display full).
Content Source have crawl type options such as Full, Incremental and Test
Then click on the start button and it will display a pop up to get confirmation regarding removing already indexed data. Please click on the OK button by accepting it. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch.
Now connector will crawl the whole TeamForge repository. It may take a minute or two for Aspire to connect to the content source and begin feeding the content.

Note that connector will be automatically initiated by the scheduler based on the schedule you specified for the connector, be it once a day, once a week, every hour, etc. But you can always start a crawl at any time by clicking on the "Start" button.

Aspire 2.0 running content source

During the Crawl

During the crawl, you can do the following:

Click on the "Refresh" button on the Content Sources page to view the latest status of the crawl.
The status will show RUNNING while the crawl is going, and CRAWLED when it is finished.
Click on "Complete" link on your connector to view the number of documents crawled so far, the number of documents submitted, and the number of documents with errors.

Aspire 2.0 content source statistics

If there are errors, you will get a clickable "Error" flag that will take you to a detailed error message page.

Step 4: Initiate an Incremental Crawl

If you only want to process content updates from TeamForge (documents which are added, modified, or removed), then click on the "Run" button when the crawl type is “incremental” link selected. The TeamForge connector will automatically identify only changes which have occurred since the last crawl.

If this is the first time that the connector has crawled, the "incremental" link does the same thing as the "Full" link selected. Both will crawl the entire content source and submit all documents. Thereafter, if you click on “Run” button when crawl type is "incremental" it will only crawl updates.

Scheduled crawls are always "Incremental” crawls. This means that the first scheduled job will perform a "Full" crawl, and jobs after that will perform "update" crawls. Statistics are reset for every crawl.

Statistics are reset for every crawl.

Group Expansion

Group expansion configuration is done on the "Advanced Connector Properties" of the Connector tab.

Click on the Advanced Configuration checkbox to enable the advanced properties section.
Scroll down to Group Expansion and click the checkbox.
Set an schedule for group expansion refresh and cleanup.
As an optional setting click on the "Use external Group Expansion" checkbox to select an LDAP Cache component for LDAP group expansion. See more info on LDAP Cache
Set the TeamForge url, user name and password of TeamForge repository.
Set an schedule for group expansion refresh and cleanup.

Limitations

The Teamforge connector is not supporting for repository commits. Users can integrate repositories (SVN. CVS, Git ..) to the Teamforge and view commits and other information. Since these repositories are external to Teamforge and having different URL patterns we are not supporting repository commits.
Also it is not supporting for default access permissions. For more information regarding default access permission please visit link

Page tree