The Hadoop Reducer Subjob Extractor stage creates a sub job with each AspireObject referenced in the [HadoopIterableWrapper] hadoopIterable job variable that comes from Hadoop and the job variable hadoopKey as the associated input key. Each sub job will also have access to another Hadoop variable: hadoopContext (current Hadoop context to write key,value pairs to).

Configuration

Element Type Default Description timeout long 10min The wait for sub jobs timeout. retryCount int 1 The number of retries if a job is rejected when try to enqueue it. A number from 1 to 5 is allowed.

Branch Configuration

The Hadoop Reducer Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

ElementTypeDescription
branches/branch/@eventStringThe event to configure. Should always be "onSubJob".
branches/branch/@pipelineManagerstringThe URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelinestringThe name of the pipeline to publish to.


Example Configuration

This section provides an example of Post HDFS configuration to a local HDFS server.

<component name="HadoopSubJobs" subType="default" factoryName="aspire-hadoop-subjob-extractor">
  <timeout>60000</timeout>
  <retryCount>4</retryCount>
  <branches>
    <branch event="onSubJob" pipelineManager="../ProcessPipeline" pipeline="doc-pipeline"/>
  </branches>          
</component>
  • No labels