Crawling Salesforce Tutorial (Aspire 2)

This tutorial walks through the steps necessary to crawl a Salesforce repository using the Salesforce connector. You may wish to read through the instructions first so that you can have all the information needed to configure the content source with Salesforce connector.

Before Beginning: Add/Create User Account

A prerequisite for crawling Salesforce is to have a Salesforce account. This account must have sufficient permissions to all documents for indexing. You may add an existing user account or create new one. If you are creating new one the recommended name for this account is "aspire_crawl_account" or something similar.

The username and password for this account will be required below.

Step 1: Set Salesforce Access Rights

The "aspire_crawl_account" will need to have sufficient access rights to read all of the documents in Salesforce that you wish to crawl.

See the prerequisites section for more details.

Step 2: Install the Salesforce Connector App into Aspire 2

To specify exactly what Salesforce repository to crawl, we will need to create a new "Content Source".

Aspire 2.0 Home Page

Launch Aspire(if it's not already running). See: Launching Aspire
Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.
From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), click on the "[Add Source]" button and select Salesforce connector from the list.

Step 2a: Specify Basic Information

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

Enter a content source name in the "Name" field.
This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.
Click on the "Active?" checkbox to add a checkmark.
Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the source is unavailable.
Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, Weekly or Advanced.
This can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.
After selecting a Schedule type, specify the details, if applicable:
Manually: No additional options. Periodically: Specify the "Run job every:" options by entering the number of "hours" and "minutes." Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options. Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl.

Step 2b: Specify connector properties

Connector properties

In the "Salesforce URL" field in the ' Content Source Properties' section, enter the Salesforce URL to crawl.
Most include the respective protocol, http or https.
Enter the Username, Password and Security Token of the crawl account you created earlier; it needs sufficient access to read all the documents in the Salesforce repository that you specified.
Note: The password will be automatically encrypted by Aspire.
Specify the number of elements to retrieve per request to the server in the "Page Size" field, or leave it set to the default of 500.
Specify time in seconds before the connection gives a timeout in the "Connection Timeout" field, or leave it set to the default of 5.
Specify number of attempts before the connection gives error in the "Connection Retries" field, or leave it set to the default of 3.
Specify the file where the query file located for Salesforce repository items in the "Salesforce Queries File" field, or leave it set to the default. (please view sQueries File)
Specify the directory where the mapDBs for the ACLs and the Hierarchy will be placed in the "MapDb's Directory" field, or leave it set to the default.
Specify the directory where the timesamp file will be placed in the "Timestamp Directory" field, or leave it set to the default.
Check the "Fetch Attachments" check box if you want fetch all the available attachments.
Specify the Salesforce API version.
Check the "Use Own Enterprise Jar" check box if you want to load your own enterprise jar, in order to crawl custom fields and objects. (please view API JAR)
Check the "Index Specific Types" check box if you want to specify which type of items to crawl. If you don't select this option, all supported Salesforce item types are going to be crawled.
Check the "Crawl Chatter Feeds?" check box if you want to crawl the Chatter feeds for each user. (please view Prerequisites)
Specify the Consumer Key Specify the Consumer Secret Specify the Chatter Page Size

Knowledge Articles properties

Check the "Fetch Knowledge Articles" check box if you want to crawl the Knowledge Articles (KA) section (You need to enable the Knowledge Base functionality in your SalesForce Server)
Select the KA type from Draft, Published or Archived if you want the standard fields select "Generic", this one will get only the standard field of every article if you want the custom fields select "Specific" Add the names of the articles types and change the sQueries files to add the query with the custom fields for the new type. (please view sQueries File)
Include/Exclude patterns:
If you want to specify include patterns, click on the 'add new' button for include patterns and specify the regex pattern. So Aspire will only crawl URLs with the specified pattern. If you want to specify exclude patterns, click on the 'add new' button for exclude patterns and specify the regex pattern. So Aspire will exclude crawling of URLs that matches the specified pattern. Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *).

Step 2c: Specify Workflow Information

Aspire 2.0 Workflow Information

In the "Workflow" tab, specify the workflow steps for the jobs that come out of the crawl.

Drag and drop rules to determine which steps should an item follow after being crawled. This rules could be where to publish the document or transformations needed on the data before sending it to a search engine. See Workflow for more information.

After completing this steps click on the Save button and you'll be sent back to the Home Page.

Step 3: Initiate the Full Crawl

Now that everything is set up, actually initiating the crawl is easy.

From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), you can see the new Salesforce connector.
Make sure connector is active and crawl type is full(click on the crawl type link if it is incremental or test until it display full).
Content Source have crawl type options such as Full, Incremental and Test
Then click on the start button and it will display a pop up to get confirmation regarding removing already indexed data. Please click on the OK button by accepting it. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch.
Now connector will crawl the whole Salesforce repository (according to your configuration). It may take a minute or two for Aspire to connect to the content source and begin feeding the content.

Note that connector will be automatically initiated by the scheduler based on the schedule you specified for the connector, be it once a day, once a week, every hour, etc. But you can always start a crawl at any time by clicking on the "Start" button.

Aspire 2.0 running content source

During the Crawl

During the crawl, you can do the following:

Click on the "Refresh" button on the Content Sources page to view the latest status of the crawl.
The status will show RUNNING while the crawl is going, and CRAWLED when it is finished.
Click on "Complete" link on your connector to view the number of documents crawled so far, the number of documents submitted, and the number of documents with errors.

Aspire 2.0 content source statistics

If there are errors, you will get a clickable "Error" flag that will take you to a detailed error message page.