Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.



On this page:

Table of Contents
 


Step 1. Launch Aspire and Open the

Content Source

Services Management Page


  1. Launch Aspire (if it's not already running).
  2. Go to Launch Control.
  3. Browse to: http://localhost:50505
  4. Select "Services" from the drop down on the right to display the "Service Management" page

For details on using the Aspire Content Source Service Management page, see Admin UI.


Image Modified

Step 2.

Select or

Add a

Content Source

Select a "Content Source" to work with.

New Service



  1. From Service ManagementFrom Content Source, click Add Source Service to add a new content source or select and existing one.

Image Removed

Step 2a. Disable Text Extraction

In the "Connector" tab:

  • Locate the "Extract Text Properties" section
    • Check "Extract Text Configuration"
      • To show the configuration parameters
    • Check "Disable text extraction"
      • To disable the text extraction

If you need text extraction, you will need to add a text extraction stage in to the work flow later

Image Removed

Step 2b. Configure Workflow Information

  1. service.
    1. Or add a custom one with the group id "com.searchtechnologies.aspire" and artifact id of "app-hdfs-binary-compactor"
Image Added

Step 2a. Configure the Service

In the "Service" tab, configure the parameters:

In the Workflow tab:

install the binary writer either by dragging it from the Applications section of the workflow configuration or by adding a custom application with the group id “com.searchtechnologies.aspire” and the artifact id “app-hdfs-binary writer”

Configure the parameters

  • HDFS Base
    • The base directory on the hdfs file system under which all content will be written
    • Form
      • hdfs://<server>:<port>/<path>
    • Example
  • Get content Content source from document
    • When checked, the content source name with be automatically extracted from the Aspire document. The field holding the value can be configured below
  • Content source field
    • The field in the Aspire document in which the content source name is held, used when getting the content source from the document
    • Form
      • /path/to/field
    • Example
      • /doc/sourceId
  • Content source
    • The static value of content source name to use when not getting the content source from the document
    • Form
      • name
    • Example
      • myContentSource
  • Use default ID field
    • By default, the id field is taken from the fetchUrl field. If you want to use a different field, uncheck this box.
  • Id field
    • When not using the default field for the document id, enter the field that contains the document id here.
    • Form
      • /path/to/field
    • Example
      • /doc/fetchUrl
  • Hash bits
    • The number of bits of the hash to use for the HDFS directory name. The filename is constructed using the MD5 hash of the document ID and the original filename. The file is then stored in a directory whose name is the least significant n bits of the hash, where n is the number of bits given here.
    • Form
      • <number>
    • Example
      • 16
  • Directory bits
  • directory
    • Add one or more content source directories that will be scanned for compaction. These directories will be located below the base given above
    • Form
      • <directory name>
    • Example
      • aspire_filesystem_source
  • Period
    • The period between scans of each content source directory
    • Form
      • <number> <unit>
    • Example
      • 12h
  • Threshold
    • The number of files that must exist in the directory before compaction takes place
  • The number of bits of the hash to use for the grouping the directories.
  • Form
    • <number>
  • Example
    • 8
  • Suppress deletes
    • Checking this box will prevent binary files that are in an existing HAR file from being deleted when a delete action is encountered.
    Instead a “marker file” will be left to indicate the binary was deleted
  • HDFS Options
    • Security
      • Choose the type of security to use to access the HDFS file system- Kerberos or none
    • User principle
      • The principal user for Kerberos
    • Key tab file
      • The user's key tab file
      • Form
        • File path
      • Example
        • config/myUser.keytab
    • Add resources
      • Check this box if you need to add Hadoop resources to the configuration (such as site-core.xml)
      • Resource file
        • The path to a resource file to add to the configuration
        • Form
          • /path/to/file
        • Example
          • config/core-site.xml
    • Block size
      • The size of block to be used when accessing the HDFS file system
      • Form
        • <number> <unit>
      • Example
        • 32mb
    • Buffer size
      • The size of buffer to be used when accessing the HDFS file system
      • Form
        • <number> <unit>
      • Example
        • 32kb
    • Replication
      • The HDFS replication factor
      • Form
        • <number>
      • Example
        • 3
    Debug
  • Configure lock
    • Check this box
  • to enable debug messages

Image Removed

Image Removed

Image Removed

Image Removed

Step 3: Initiate a Crawl

Now that the HDFS writer is set up, a crawl can be initiated.

    • if you want to configure the lock files
    • Tries
      • The number of attempts to get a lock file before moving on
      • Form
        • Number
      • Example
        • 3
    • Retry sleep
      • The period to wait before attempting to get the lock when a previous attempt has failed
      • Form
        • <number> <unit>
      • Example
        • 15s
    • Expiry
      • The period after which a lock is deemed to have expired and will be released when another process attempts to get it.
      • Form
        • <number> <unit>
      • Example
        • 12h
  • Debug
    • Check this box to enable debug messages

Save the service and it will start automatically

Image Added

Image Added

Image Added

When crawling the content will be written to HDFS