On this page:

Table of Contents

Step 1. Launch Aspire and Open the

Content Source

Services Management Page

Step 2.

Select or

Add a

Content Source

Select a "Content Source" to work with.

New Service

Step 2a. Disable Text Extraction

Step 2b. Configure Workflow Information

Step 2a. Configure the Service

In the "Service" tab, configure the parameters:

In the Workflow tab:

install the binary writer either by dragging it from the Applications section of the workflow configuration or by adding a custom application with the group id “com.searchtechnologies.aspire” and the artifact id “app-hdfs-binary writer”

Configure the parameters

HDFS Base
- The base directory on the hdfs file system under which all content will be written
- Form
  - hdfs://<server>:<port>/<path>
- Example
  - hdfs://localhost:9000/myBase
Get content Content source from document

When checked, the content source name with be automatically extracted from the Aspire document. The field holding the value can be configured below

Content source field
- The field in the Aspire document in which the content source name is held, used when getting the content source from the document
- Form
  - /path/to/field
- Example
  - /doc/sourceId
Content source
- The static value of content source name to use when not getting the content source from the document
- Form
  - name
- Example
  - myContentSource
Use default ID field
- By default, the id field is taken from the fetchUrl field. If you want to use a different field, uncheck this box.
Id field

When not using the default field for the document id, enter the field that contains the document id here.
Form
- /path/to/field
Example
- /doc/fetchUrl

Hash bits
- The number of bits of the hash to use for the HDFS directory name. The filename is constructed using the MD5 hash of the document ID and the original filename. The file is then stored in a directory whose name is the least significant n bits of the hash, where n is the number of bits given here.
- Form
  - <number>
- Example
  - 16

Directory bits

directory

Add one or more content source directories that will be scanned for compaction. These directories will be located below the base given above
Form
- <directory name>
Example
- aspire_filesystem_source

Period
- The period between scans of each content source directory
- Form
  - <number> <unit>
- Example
  - 12h
Threshold
- The number of files that must exist in the directory before compaction takes place
The number of bits of the hash to use for the grouping the directories.
Form
- <number>
Example
- 8
Suppress deletes

Checking this box will prevent binary files that are in an existing HAR file from being deleted when a delete action is encountered.

Instead a “marker file” will be left to indicate the binary was deleted

HDFS Options
- Security
  - Choose the type of security to use to access the HDFS file system- Kerberos or none
- User principle
  - The principal user for Kerberos
- Key tab file
  - The user's key tab file
  - Form
    - File path
  - Example
    - config/myUser.keytab
- Add resources
  - Check this box if you need to add Hadoop resources to the configuration (such as site-core.xml)
  - Resource file
    - The path to a resource file to add to the configuration
    - Form
      - /path/to/file
    - Example
      - config/core-site.xml
- Block size
  - The size of block to be used when accessing the HDFS file system
  - Form
    - <number> <unit>
  - Example
    - 32mb
- Buffer size
  - The size of buffer to be used when accessing the HDFS file system
  - Form
    - <number> <unit>
  - Example
    - 32kb
- Replication
  - The HDFS replication factor
  - Form
    - <number>
  - Example
    - 3
Debug
Configure lock
- Check this box

to enable debug messages

Image Removed

Step 3: Initiate a Crawl

Now that the HDFS writer is set up, a crawl can be initiated.

- if you want to configure the lock files
- Tries
  - The number of attempts to get a lock file before moving on
  - Form
    - Number
  - Example
    - 3
- Retry sleep
  - The period to wait before attempting to get the lock when a previous attempt has failed
  - Form
    - <number> <unit>
  - Example
    - 15s
- Expiry
Debug
- Check this box to enable debug messages

Save the service and it will start automatically

Image Added

When crawling the content will be written to HDFS

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Step 1. Launch Aspire and Open the

Services Management Page

Step 2.

Add a

New Service

Step 2a. Disable Text Extraction

Step 2b. Configure Workflow Information

Step 2a. Configure the Service

Step 3: Initiate a Crawl

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Step 1. Launch Aspire and Open the

Services Management Page

Step 2.

Add a

New Service

Step 2a. Disable Text Extraction

Step 2b. Configure Workflow Information

Step 2a. Configure the Service

Step 3: Initiate a Crawl