he Post HDFS Stage stage writes key/value pairs into HDFS where the key is a user-defined field from the job's AspireObject (or the job id, if the key is not defined) and the value is the AspireObject of the job. Each key/value pair will be written to a single file until a file size threshold is reached. A new file is then created with a sequential id (i.e. aspire-00000, aspire-00001, aspire-00002, ..., aspire-N).
Communication to HDFS will be through the HDFS API FileSystem methods.
Post HDFS | |
---|---|
Factory Name | com.accenture.aspire:aspire-post-hdfs |
subType | default |
Inputs | An AspireObject with the metadata of each document to be posted and a key (optional). |
Outputs | A HDFS file entry consisting of the key and a JSON representation of the AspireObject as the value. |
This section lists all configuration parameters available to configure the Post HDFS component.
Element | Type | Default | Description |
---|---|---|---|
hdfsUrl | String | hdfs://localhost:8020 | The HDFS Namenode URL. |
folderPath | String | The path within the HDFS server where the files will be stored. If empty, the user home folder will be used. | |
filePrefixName | String | aspire | The prefix of the name of the files that will be stored. Each file name will be completed with a sequential counter value. (I.e. aspíre-00000). |
fileSize | long | HDFS Default Block Size | The max size of each file to be created. When the file size is reached, a new file is created. |
outputKey | String | An AXPath of the metadata field to use as the output key. | |
ignoreAspireBatch | boolean | true | Tells the component whether or not create a new file for each Aspire batch. NOTE: If this is false and Aspire Job batching is enabled, the fileSize value will be ignored and each file will contain exactly as many key/value pairs as the batch size. |
timeout | int | 30000 | Time in milliseconds to wait until the file can be closed, after the last job has been processed. |
This section provides an example of Post HDFS configuration to a local HDFS server.
<component name="PostHDFS" subType="default" factoryName="aspire-post-hdfs"> <hdfsUrl>hdfs://localhost:8020/</hdfsUrl> <folderPath>/user/jsmith/test/</folderPath> <filePrefixName>aspire-</filePrefixName> <outputKey>weekDay</outputKey> </component>
Monday
{"doc":{"weekDay":"Monday","name":"jsmith","date":"2013\/07\/16","url":"http:\/\/www.searctechnologies.com\/products\/we-are-great.html"}}
Wednesday
{"doc":{"weekDay":"Wednesday","name":"jsmith","date":"2013\/07\/16","url":"http:\/\/www.searctechnologies.com\/home.html"}}