ERoom Crawling Tutorial (Aspire 2)

Before Beginning: Create User Account

A prerequisite for crawling eRoom is to have an eRoom account with sufficient permissions. The recommended name for this account is "aspire_crawl_account" or something similar.

The username and password for this account will be required below.

Configuration Steps

Set eRoom Access Rights

The "aspire_crawl_account" will need to have sufficient access rights to read all of the documents in Eroom that you wish to crawl.

To set the rights for your "aspire_crawl_account", do the following:

Log into the eRoom Server as an Administrator.
Click on Site Settings.
Click on General.
Click on Members icon in "Site Administrator" section.
Click on "aspire_crawl_account" or click on add to create a new one.
Make the role of the "aspire_crawl_account" either site administrator and community administrator (so that it has access to all Eroom content).

You will need this login information later in these procedures, when entering properties for your Eroom Connector.

Launch Aspire

Aspire Main Admin Page

Launch Aspire (if it's not already running). See:

Launching Aspire

For details on using the Aspire System Administration page, please refer to Management.

Note: Above your server are elements related to Distributed Communications. By default, they are disabled. To enable these features, please refer to Settings Configuration for information.

Install a new eRoom Content Source

Add new source

To specify exactly what eRoom server and site to crawl, we will need to create a new "Content Source".

To create a new content source:

From the Aspire 2 Home page, click on "Add Source" button.
Click on "eRoom Connector".

Specify Basic Information

General Configuration Tab

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

Enter a content source name in the "Name" field.
This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.
Click on the "Active" checkbox to add a checkmark.
Unchecking the "Active" option allows you to configure content sources but not have them enabled. This is useful if a SharePoint site is down, being maintained, etc.
Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, or Weekly.
The Content Source Manager (CS Manager) can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.
After selecting a Schedule type, specify the details, if applicable:
1. Manually: No additional options.
2. Periodically: Specify the "Run every:" options by entering the number of "hours" and "minutes."
3. Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options.
4. Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl.
5. Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *)

Specify the Connector Information

Connector Configuration Tab

In the "Connector" tab, specify the connection information to crawl eRoom.

Enter the eRoom URL you want to crawl.(check eRoom Urls)
Enter the account info for the crawl user (username and password).
Check on the other options as needed:
1. Include/Exclude patterns: Enter regex patterns to include or exclude items.
2. Scan Recursively: Scan through container's child nodes.
3. Index Containers: index sites, lists and folders. If unchecked, only list items will be indexed.
4. Use Group Expansion: Specify the eRoom Url, username, password and the schedule for the group expansion process.
5. Use SSL: Specify the path of the keystore file and the password when you try to crawl a https url.

Connector Advance Properties

eRoom URLs

A eRoom "URL" is needed to tell the connector application what to crawl. This should be the URL of one Eroom in the Eroom server.

It must have the following format: http://server/eRoom/facility_name/eroom_name/

Examples:

http://win-cca7ctfki2n/eRoom/Test_Facility/TEST_EROOM/
will crawl the complete eroom named TEST_EROOM
http://kmwermtst05.corp.emc.com/eRoom/emc1/eRoomGuide/
will crawl the complete eroom named eRoomGuide

Note:
     The url of an item like this: 
     http://win-cca7ctfki2n/eRoom/Test_Facility/TEST_EROOM/0_99b/
     
     does not work for the crawl due we need the complete hierarchy of acls of the parent items 
     of the specific item in order to generate the acls of the item and we don't have a way to 
     get the parent of an item.

Define Patterns (Optional)

Include or Exclude PatternsYou can use Java regular expressions to specifically include or exclude patterns to index. These are optional. If you enter index patterns to accept or reject, the URL will be compared to the pattern and indexed or not indexed, as specified. You can enter multiple patterns to include or exclude.

To add a new expression, click on an "Add New" link, then enter the expression in the text field. For example, to exclude a Database named ROOT_DB and all of their children, you would enter: ".*ROOT_DB.*" (The defaults for indexing patterns are "none"; you can enter one pattern, multiple patterns, or no patterns.)

To remove an expression, click on its X icon.

Specify Workflow Information

Workflow Configuration Tab

In the "Workflow" tab, specify the workflow steps for the jobs that come out of the crawl.

Drag and drop rules to determine which steps should an item follow after being crawled. This rules could be where to publish the document or transformations needed on the data before sending it to a search engine.

After completing this steps click on the Save button and you'll be sent back to the Home Page.

Initiate the Full Crawl

Start Crawl

Now that the content source is set up, the crawl can be initiated.

The Content Source have some crawl type options:
1. "Full" - Click on start button to start the crawl. The button will start a full crawl
2. "Incremental" - The button will start an incremental crawl (the first time this will work like a full crawl. After the first crawl, use the button (incremental) to crawl for any changes done in the repository.
3. "Real Time Crawl" - Use the button with Staging Repository configuration.
4. "Test" - Use the button to perform a test crawl.

During the Crawl

Crawl Statistics

During the crawl, you can do the following:

Click on the "Refresh" button on the Content Sources page to view the latest status of the crawl.
The status will show RUNNING while the crawl is going, and CRAWLED when it is finished.
Click on "Complete" to view the number of documents crawled so far, the number of documents submitted, and the number of documents with errors.

If there are errors, you will get a clickable "Error" flag that will take you to a detailed error message page.

Advance Properties (optional)

This section shows how to configure the advance properties of the connector. Also you can check Connector Properties for more info

Advance Properties

Group Expansion

Group expansion configuration is done on the "Advanced Connector Properties" of the Connector tab.

Click on the Advanced Configuration checkbox to enable the advanced properties section.
Scroll down to Group Expansion and click the checkbox.
Set the url for the expansion.
Set the user name and password of the crawl account.
Set an schedule for group expansion refresh and cleanup.
As an optional setting click on the "Use external Group Expansion" checkbox to select an LDAP Cache component for LDAP group expansion. See more info on the LDAP Cache component on LDAP Cache

Note:
     In order to get the correct information stored in the Group Expansion Cache, you need to perform a full crawl to get the special acls that 
     will be stored in the Group Expansion Cache too. Because in eRoom we generate an intersection acls of the parent and child acl, not only
     the groups or users.

Use SSL

Use SSL is done on the "Advanced Connector Properties" of the Connector tab.

Click on the Advanced Configuration checkbox to enable the advanced properties section.
Scroll down to Use SSL and click the checkbox.
Set the directory path for the keyStore of the site (include the full path with name and extension).
Set the password of the keyStore.

See more info of how to get the keyStore from the site on Crawling Eroom over HTTPS

Page tree