The Hadoop Reducer Subjob Extractor stage creates a sub job with each AspireObject referenced in the [HadoopIterableWrapper] hadoopIterable job variable that comes from Hadoop and the job variable hadoopKey as the associated input key. Each sub job will also have access to another Hadoop variable: hadoopContext (current Hadoop context to write key,value pairs to).
Configuration
Element Type Default Description timeout long 10min The wait for sub jobs timeout. retryCount int 1 The number of retries if a job is rejected when try to enqueue it. A number from 1 to 5 is allowed.
Branch Configuration
The Hadoop Reducer Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.
Element | Type | Description |
---|---|---|
branches/branch/@event | String | The event to configure. Should always be "onSubJob". |
branches/branch/@pipelineManager | string | The URL of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. |
Example Configuration
This section provides an example of Post HDFS configuration to a local HDFS server.
<component name="HadoopSubJobs" subType="default" factoryName="aspire-hadoop-subjob-extractor"> <timeout>60000</timeout> <retryCount>4</retryCount> <branches> <branch event="onSubJob" pipelineManager="../ProcessPipeline" pipeline="doc-pipeline"/> </branches> </component>