Step 1: Launch Aspire and open the Content Source Management Page



Aspire Content Source Management Page

Launch Aspire (if it's not already running). See:

Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.



Step 2: Add a the Semantic Co-occurrence Solution



Add new source

To specify exactly what shared folder to crawl, we will need to create a new "Content Source".

To create a new content source:

  1. From the Aspire 2 Home page, click on "Add Source" button.
  2. Click on "Semantic Co-occurrence".

Step 2a: Specify Basic Information



General Configuration Tab

In the "General" tab in the Add New Content Source window, specify basic information for the content source:

  1. Enter a content source name in the "Name" field.

    This is any useful name which you decide is a good name for the source. It will be displayed in the content source page, in error messages, etc.

  2. Click on the "Active?" checkbox to add a checkmark.

    Unchecking the "Active?" option allows you to configure content sources but not have them enabled. This is useful if the folder will be under maintenance and no crawls are wanted during that period of time.

  3. Click on the "Schedule" drop-down list and select one of the following: Manually, Periodically, Daily, or Weekly.

    Aspire can automatically schedule content sources to be crawled on a set schedule, such as once a day, several times a week, or periodically (every N minutes or hours). For the purposes of this tutorial, you may want to select Manually and then set up a regular crawling schedule later.

  4. After selecting a Schedule type, specify the details, if applicable:
    • Manually: No additional options.
    • Periodically: Specify the "Run every:" options by entering the number of "hours" and "minutes."
    • Daily: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options.
    • Weekly: Specify the "Start time:" by clicking on the hours and minutes drop-down lists and selecting options, then clicking on the day checkboxes to specify days of the week to run the crawl.
    • Advance: Enter a custom CRON Expression (e.g. 0 0 0 ? * *)


Step 2b: Specify the Solution Configuration



Connector Configuration Tab

In the "Connector" tab, specify the connection information for your semantic co-occurrence solution.

  1. Enter the HDFS NameNode Location. For Cloudera the default port is 8020. I.e. hdfs://name-node-server:8020
  2. Enter the path to a local copy of the Hadoop configuration folder. For Cloudera, found by default at /etc/hadoop/conf
  3. Enter the path to the HDFS Input folder. AspireInputFormat is the required data format (Text -> AspireObjectWritable).
  4. Enter the path to the HDFS Output folder.
  5. Select the Overwrite checkbox if you want to overwrite the output HDFS folders.
  6. Select the checkbox if you want to remove the Non English Characters from the data
  7. Fields to process: Enter the fields to process with their associated language. These fields are going to be extracted from the AspireObject
  8. Enter the local path to the Aspire for Hadoop Distribution to use. See configuring Aspire for Hadoop distribution
  9. Enter the number of reducers the token processing job will run with.
  10. Enter the hostname where the Redis server is running
  11. Enter the port in which Redis server is running.
  12. Enter the database name of the database with the master dictionary phrase entries.
  13. Enter the Redis database that stores the mappings of the database names. Usually use 0. If -1 the Redis Hashmap databases will be used instead.
  14. Enter the Redis connection timeout in milliseconds
  15. Enter the size of the queries to post to redis database at a time
  16. Enter the folder where the data to be copied to HDFS is located.
  17. Enter the threshold for the phrases minimum weight to accept
  18. Enter the max number of tokens of the phrases to evaluate.
  19. Enter the minimum number of documents containing the phrase
  20. Enter the how many most statistically relevant phrases to consider.
  21. Enter the minimum document count for a token pair to occur in.
  22. Enter the minimum number of cooccurrences for a token to be taken into account for a pair.
  23. Enter the maximum number of related tokens to generate per token.

After completing this steps click on the Save button and you'll be sent back to the Home Page.


Step 3: Initiate the process



Start Crawl

Now that the content source is set up, the crawl can be initiated.

  1. Click on the Start button to begin processing the data on the hadoop cluster.


During the Crawl


Crawl Statistics


  • No labels