The Aspire Hadoop Map Reduce job is an Hadoop MapRed jar configured to run a Mapper, a Combiner and/or a Reducer with an embedded Aspire Application that will control their output key/value pairs. The Driver receives an aspireHadoop configuration xml with the configuration of the Mapper, Combiner and/or Reducer. Each section of the xml (called "map", "combine", "reduce", "combine-reduce") should contain a pipeline manager called Main and a main pipeline called doc-process; this is what each Aspire Application instance will run as the mapper, combiner or reducer process.

For information about using the Aspire Hadoop MapRed Driver pre Aspire 2.2 see here.

On this page:

Configure Aspire Distribution for Aspire Hadoop MapRed Driver

Aspire Hadoop MapRed Driver will require an additional, special version of an Aspire distribution.

Download the Aspire for Hadoop Distribution.
Unzip the distribution.
Edit the file config/settings.xml to add your Aspire credentials.
Save the file.
Add any external files or components that your distribution might require to have local access to (not through Maven).

The driver will package and distribute this Aspire for Hadoop distribution to all nodes in the cluster for the remote execution of the map reduce jobs.

Map Reduce Application XML

The map reduce job is configured through the Map Reduce Application XML. This XML file describes the different map/reduce stages as Aspire application components to know: map, combine, reduce or combine-reduce (the same stage will be executed as a combine and then as a reduce stage). Each application component is expected to have a pipeline manager component named Main and a pipeline named doc-process that will control the main flow of the jobs for that stage.

The jobs have access to three different variables to interact with Hadoop:

hadoopKey: The input key for this stage. Can either come from the input file or from a context write operation on a previous stage. This variable is available for all stages.
hadoopContext: Exposes access to Hadoop context for write (emit) operations. This variable can be accessed through groovy scripts or any other Aspire component called inside of the Main pipeline manager. This variable is available for all stages.
hadoopIterable: Gives the job access to the list of values associated with the input key. This variable is only available for combine, reduce and combine-reduce stages.

Note: All map/reduce stages are optional, if a stage is not specified in the application XML, Hadoop automatically executes an identity function for map or reduce stages or nothing for a combiner stage.

Below is an example of a Map Reduce Application XML with all job stages defined.

<?xml version="1.0" encoding="UTF-8"?>
<aspireHadoop>
  <map name="MapApplication">
    <components>
      <component factoryName="aspire-application" name="Main" subType="pipeline">
        <components>
          <component factoryName="aspire-groovy" name="Map" subType="default">
           <script>
             <![CDATA[
               import com.searchtechnologies.aspire.services.AspireObject;	

               doc.content.text().split("\\s+").each(){
                 hadoopContext.write(it, new AspireObject("doc", 1));
               }
            ]]>
           </script>
         </component>
       </components>
       <pipelines>
         <pipeline default="true" name="doc-process">
           <stages>
             <stage component="Map"/>
           </stages>
         </pipeline>
       </pipelines>
      </component>
    </components>
  </map>
  <combine name="CombineApplication">
    <components>
      <component factoryName="aspire-application" name="Main" subType="pipeline">
        <components>
          <component factoryName="aspire-hadoop-emit" name="Emit" subType="default">
            <keyTemplate>{hadoopKey}</keyTemplate>
          </component>
          <component factoryName="aspire-groovy" name="Combine" subType="default">
            <script>
              <![CDATA[
                def wordCount = 0;

                hadoopIterable.each() {
                  wordCount += it.getContent();
                }
                doc.setContent(wordCount);
             ]]>
            </script>
          </component>
        </components>
        <pipelines>
          <pipeline default="true" name="doc-process">
            <stages>
              <stage component="Combine"/>
              <stage component="Emit"/>
            </stages>
          </pipeline>
        </pipelines>
      </component>
    </components>
  </combine>
  <reduce name="ReduceApplication">
    <components>
      <component factoryName="aspire-application" name="Main" subType="pipeline">
        <components>
          <component factoryName="aspire-hadoop-emit" name="Emit" subType="default">
            <keyTemplate>{hadoopKey}</keyTemplate>
          </component>
          <component factoryName="aspire-groovy" name="Reduce" subType="default">
            <script>
              <![CDATA[
                def wordCount = 0;

                hadoopIterable.each() {
                  wordCount += it.getContent();
                }
                doc.setContent(wordCount);
             ]]>
            </script>
          </component>
        </components>
        <pipelines>
          <pipeline default="true" name="doc-process">
            <stages>
              <stage component="Reduce"/>
              <stage component="Emit"/>
            </stages>
          </pipeline>
        </pipelines>
      </component>
    </components>
  </reduce>
</aspireHadoop>

Properties XML

An application XML can be made configurable so it becomes a general purpose application for different scenarios. Inside an application XML, parameters can be specified with the standard Aspire convention: ${myParameter}.

A list of parameters can then be defined as shown in the example below.

<properties>
  <property name="myParameter">true</property>
  <property name="myParameter2">this is a string value</property>
</properties>

Example of a configurable map application XML with its properties XML.

<?xml version="1.0" encoding="UTF-8"?>
<aspireHadoop>
  <map name="MapApplication">
    <components>
      <component factoryName="aspire-application" name="Main" subType="pipeline">
        <components>
          <component factoryName="aspire-groovy" name="Map" subType="default">
           <variable name="fieldToProcess">return "${fieldToProcess}"</variable>
           <script>
             <![CDATA[
               import com.searchtechnologies.aspire.services.AspireObject;	

               doc.get(fieldToProcess).text().split("\\s+").each(){
                 hadoopContext.write(it, new AspireObject("doc", 1));
               }
            ]]>
           </script>
         </component>
       </components>
       <pipelines>
         <pipeline default="true" name="doc-process">
           <stages>
             <stage component="Map"/>
           </stages>
         </pipeline>
       </pipelines>
      </component>
    </components>
  </map>
</aspireHadoop>

<properties> 
  <property name="fieldToProcess">content</property>
</properties>

Parameters

Parameter	Default	Description
hdfs-input	None	HDFS directory containing the input data.
hdfs-output	None	HDFS directory that will contain the output of this component
aspire-distribution-location	None	Path to the Local Aspire distribution directory. The driver will automatically distribute the Aspire distribution across all nodes.
application	None	Path to the Aspire application xml file with the Map/Reduce implementation
properties	none	Path to the properties xml file associated with the application xml.
retry-load	True	(optional) Whether or not to retry loading Aspire
num-reducers3	2	(optional) How many reducers to create

Run Hadoop jar

From a hadoop client, run the Aspire Hadoop Map Reduce jar:

hadoop jar aspire-hadoop-mapred-2.0.jar com.searchtechnologies.aspire.hadoop.AspireHadoopDriver <hdfs-input> <hdfs-output> <aspire-distribution-location> <application> <properties> <retryLoad> <num-reducers>

Example without properties file

hadoop jar aspire-hadoop-mapred-2.0.jar com.searchtechnologies.aspire.hadoop.AspireHadoopDriver input output /local-path/to/distro /local-path/to/app.xml none

hadoop jar aspire-hadoop-mapred-2.0.jar com.searchtechnologies.aspire.hadoop.AspireHadoopDriver input output /local-path/to/distro /local-path/to/app.xml /local-path/to/properties.xml

Page tree

Aspire Hadoop MapRed Driver