Semantic Co-occurrence Solution Tutorial (Aspire 2)

Step 2: Add a the Semantic Co-occurrence Solution

Add new source

To specify exactly what shared folder to crawl, we will need to create a new "Content Source".

To create a new content source:

From the Aspire 2 Home page, click on "Add Source" button.
Click on "Semantic Co-occurrence".

Step 2a: Specify Basic Information

General Configuration Tab

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

Enter a content source name in the "Name" field.
This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.
Click on the "Active?" checkbox to add a checkmark.
Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the folder will be under maintenance and no crawls are wanted during that period of time.
Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, or Weekly.
Aspire can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.
After selecting a Schedule type, specify the details, if applicable:
- Manually: No additional options.
- Periodically: Specify the "Run every:" options by entering the number of "hours" and "minutes."
- Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options.
- Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl.
- Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *)

Step 2b: Specify the Solution Configuration

Connector Configuration Tab

In the "Connector" tab, specify the connection information for your semantic co-occurrence solution.

Enter the HDFS NameNode Location. For Cloudera the default port is 8020. I.e. hdfs://name-node-server:8020
Enter the path to a local copy of the Hadoop configuration folder. For Cloudera, found by default at /etc/hadoop/conf
Enter the path to the HDFS Input folder. AspireInputFormat is the required data format (Text -> AspireObjectWritable).
Enter the path to the HDFS Output folder.
Select the Overwrite checkbox if you want to overwrite the output HDFS folders.
Select the checkbox if you want to remove the Non English Characters from the data
Fields to process: Enter the fields to process with their associated language. These fields are going to be extracted from the AspireObject
Enter the local path to the Aspire for Hadoop Distribution to use. See configuring Aspire for Hadoop distribution
Enter the number of reducers the token processing job will run with.
Enter the hostname where the Redis server is running
Enter the port in which Redis server is running.
Enter the database name of the database with the master dictionary phrase entries.
Enter the Redis database that stores the mappings of the database names. Usually use 0. If -1 the Redis Hashmap databases will be used instead.
Enter the Redis connection timeout in milliseconds
Enter the size of the queries to post to redis database at a time
Enter the folder where the data to be copied to HDFS is located.
Enter the threshold for the phrases minimum weight to accept
Enter the max number of tokens of the phrases to evaluate.
Enter the minimum number of documents containing the phrase
Enter the how many most statistically relevant phrases to consider.
Enter the minimum document count for a token pair to occur in.
Enter the minimum number of cooccurrences for a token to be taken into account for a pair.
Enter the maximum number of related tokens to generate per token.

After completing this steps click on the Save button and you'll be sent back to the Home Page.

Page tree

Step 1: Launch Aspire and open the Content Source Management Page

Step 2: Add a the Semantic Co-occurrence Solution

Step 2a: Specify Basic Information

Step 2b: Specify the Solution Configuration

Step 3: Initiate the process

During the Crawl