Crawling Socialcast Tutorial (Aspire 2)

This tutorial walks through the steps necessary to crawl Socialcast repository using the Socialcast connector. You may wish to read through the instructions first so that you can gather up the information needed to input the crawling parameters.

Before Beginning: Add/Create User Account

A prerequisite for crawling Socialcast is to have a Socialcast account. This account should have sufficient permissions to crawl documents for indexing. So it should have access to all the groups (including private groups) that you want to crawl. Yo may add an existing user account or create new one. If you are creating new one the recommended name for this account is "aspire_crawl_account" or something similar.

The username and password for this account will be required below.

Step 1: Set Socialcast Access Rights

The "aspire_crawl_account" will need to have sufficient access rights to read all of the documents in Socialcast that you wish to crawl.

To set the rights for your "aspire_crawl_account", do the following:

Log into the Socialcast as an administrator.
Click on Settings icon next to your user name (in the top of the right hand) and select admin settings.
In the left hand side menu click on ‘User Management’ under ‘Community Management’.
Select "aspire_crawl_account" (Assume you have already set up this user).
Click on Actions icon for "aspire_crawl_account" and click ‘Edit’
Under User type click on Member
To add this user as an Admin, check Admin checkbox.
Now this user is an Admin now and have Full Control (so that it has access to all Socialcast content).
Add newly created admin user account to all the groups that you want to crawl.

You may skip step 1 to 6 if you want to add an existing account. You will need this login information later in these procedures, when entering properties for your Socialcast Connector.

Step 2: Install the Socialcast Connector App into Aspire 2

To specify exactly what Socialcast repository to crawl, we will need to create a new "Content Source".

Aspire 2.0 Home Page

Launch Aspire(if it's not already running). See: Launching Aspire
Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.
From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), click on the "[Add Source]" button and select Socialcast connector from the list.

Step 2a: Specify Basic Information

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

Enter a content source name in the "Name" field.
This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.
Click on the "Active?" checkbox to add a checkmark.
Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the source is unavailable.
Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, Weekly or Advanced.
This can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.
After selecting a Schedule type, specify the details, if applicable:
Manually: No additional options. Periodically: Specify the "Run job every:" options by entering the number of "hours" and "minutes." Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options. Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl. Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *).

Step 2b: Specify connector properties

Connector properties

In the "Community URL" field in the ' Content Source Properties' section, enter the Socialcast Community URL to crawl.
Most include the respective protocol, http or https.
Enter the Username and Password of the crawl account you created earlier; it needs sufficient access to crawl the Socialcast Community that you specified.
Note: The password will be automatically encrypted by Aspire.
Specify the number of elements per page in the "Page Size" field, or leave it set to the default of 100 (maximum number).
Include/Exclude patterns:
If you want to specify include patterns, click on the 'add new' button for include patterns and specify the regex pattern. So Aspire will only crawl URLs with the specified pattern. If you want to specify exclude patterns, click on the 'add new' button for exclude patterns and specify the regex pattern. So Aspire will exclude crawling of URLs that matches the specified pattern. For Socialcast attachments you can user include or exclude patterns with attachment file name.

Step 2c: Specify Workflow Information

Aspire 2.0 Workflow Information

In the "Workflow" tab, specify the workflow steps for the jobs that come out of the crawl.

Drag and drop rules to determine which steps should an item follow after being crawled. This rules could be where to publish the document or transformations needed on the data before sending it to a search engine. See Workflow for more information.

After completing this steps click on the Save button and you'll be sent back to the Home Page.

Step 3: Initiate the Full Crawl

Now that everything is set up, actually initiating the crawl is easy.

From the Aspire Admin page (http://localhost:50505/aspire/files/home.html), you can see the new Socialcast connector.
Make sure connector is active and crawl type is full(click on the crawl type link if it is incremental or test until it display full).
Content Source have crawl type options such as Full, Incremental and Test
Then click on the start button and it will display a pop up to get confirmation regarding removing already indexed data. Please click on the OK button by accepting it. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch. Initiating a Full Crawl will remove any previously-indexed information (incremental snapshots) and re-index the source from scratch.
Now connector will crawl the whole Socialcast repository. It may take a minute or two for Aspire to connect to the content source and begin feeding the content.

Note that connector will be automatically initiated by the scheduler based on the schedule you specified for the connector, be it once a day, once a week, every hour, etc. But you can always start a crawl at any time by clicking on the "Start" button.

Aspire 2.0 running content source

During the Crawl

During the crawl, you can do the following:

Click on the "Refresh" button on the Content Sources page to view the latest status of the crawl.
The status will show RUNNING while the crawl is going, and CRAWLED when it is finished.
Click on "Complete" link on your connector to view the number of documents crawled so far, the number of documents submitted, and the number of documents with errors.

Aspire 2.0 content source statistics

If there are errors, you will get a clickable "Error" flag that will take you to a detailed error message page.

Step 4: Initiate an Incremental Crawl

If you only want to process content updates from Socialcast (documents which are added, modified, or removed), then click on the "Run" button when the crawl type is “incremental” link selected. The Socialcast connector will automatically identify only changes which have occurred since the last crawl.

If this is the first time that the connector has crawled, the "incremental" link does the same thing as the "Full" link selected. Both will crawl the entire content source and submit all documents. Thereafter, if you click on “Run” button when crawl type is "incremental" it will only crawl updates.

Scheduled crawls are always "Incremental” crawls. This means that the first scheduled job will perform a "Full" crawl, and jobs after that will perform "update" crawls. Statistics are reset for every crawl.

Statistics are reset for every crawl.

Group Expansion

Group expansion configuration is done on the "Advanced Connector Properties" of the Connector tab.

Click on the Advanced Configuration checkbox to enable the advanced properties section.
Scroll down to Group Expansion and click the checkbox.
Set the Socialcast url, user name and password of Socialcast repository.
Set an schedule for group expansion refresh and cleanup.

Limitations

Crawling account should have access to all the groups (including private groups) that you want to crawl. With default settings administrator can't access private groups as well (if he is not a member of that group). But company can request community administrator access to private groups. Please check "Accessing Private Groups" section in Socialcast documentation. Also as a workaround for this, crawling user can be added to all the groups in Socialcast site (as a member).
If the crawling user is not belongs to a conversation, then crawling user can't access the content of that conversation. In that case we can't crawl the conversation. Please check "Private messages" section in Socialcast documentation.
Socialcat has user type called "External contributors" and external contributors have limited access to site content. Our Socialcast connector doesn't support external contributors. And we assume search is only available for internal members.

Page tree