Step 2. Add a New Content Source
For this step please follow the step from the Configuration Tutorial of the connector of you choice, please refer to Connector list
Step 3. Add a New Publish to CDH-HDFS to the Workflow
To add a Publish to CDH-HDFS drag from the Publish to CDH-HDFS rule from the Workflow Library and drop to the Workflow Tree where you want to add it. This will automatically open the Publish to CDH-HDFS window for the configuration of the publisher.
- Enter the name of the publisher. (This name must be unique).
- Enter the description of the publisher that will be shown in the Workflow Tree.
- Select the publishing protocol to use:
- HDFS (Java API)
- WebHDFS (REST API)
Not all HDFS clusters have WebHDFS enabled.
3.a. Publish Using HDFS
In the HDFS section of the Publish to CDH HDFS window specify the connection information to publish to HDFS.
- Enter the HDFS URL. Use hdfs:// protocol and the port (by default 8020). I.e. hdfs://localhost:8020
- Specify the location of the Output key. An AXPath of the node inside the AspireObject. I.e. /doc/docType
- Specify the absolute HDFS Folder Path where the files will be published to. I.e. /user/jsmith/my_aspire_output. (The user which runs Aspire must have write access to the HDFS folder).
- Specify the Max File Size in MegaBytes. If left as -1 it will use the HDFS Block Size as the file limit.
- Specify a File Prefix Name. I.e. aspire-, files will be named: aspire-00000, aspire-00001, aspire-00002m, etc.
- Debug: Check if you want to run the publisher in debug mode.
- Click Add.
Once you've clicked on the Add button, it will take a moment for Aspire to download all of the necessary components (the Jar files) from the Maven repository and load them into Aspire. Once that's done, the publisher will appear in the Workflow Tree.
For details on using the Workflow section, please refer to the Workflow introduction.