Developing with Hadoop

Aspire for Hadoop provides a collection of Java classes that implement the necessary Hadoop objects to build an interaction between Aspire and Hadoop to be able to program and configure big data jobs using Aspire pipelines.

On this page:

Developing with Aspire for Hadoop

The AspireHadoopMapper, AspireHadoopReducer and AspireHadoopCombiner are implementations of mappers, reducers and combiners Hadoop task tracker jobs that will launch Aspire application pipelines to run their tasks through. A generic AspireHadoopDriver is also provided as part of the Aspire Hadoop MapRed component as a Hadoop job that allows the configuration of map/reduce jobs using at least a mapper configuration with the combiner and the reducer as optional.

When a new Aspire Job is created inside one of these Hadoop task tracker jobs (AspireHadoopMapper, AspireHadoopReducer or AspireHadoopCombiner) the required Hadoop objects are linked in the Aspire Job so any Aspire component configured in the pipeline can interact with Hadoop (key, values, context, counters).

To create the different Aspire pipelines for your Hadoop jobs any Aspire component can be used.

The list of available components can be found here:

Interacting with Hadoop from Aspire

Aside from the provided components, Aspire gives you the ability to create your own components to interact with Hadoop, or to write this interaction with Hadoop using the aspire-groovy component as described below.

List of available Aspire classes to interact with Hadoop

Aspire Hadoop HDFS Classes
- AspireObjectWritable - A Hadoop Writable to serialize and deserialize AspireObject instances.
- AspireInputFormat - A text based FileInputFormat with Text keys and AspireObject values.
  - AspireRecordReader - Reads text based record lines, where the key and value are separated by a tab (\t) delimiter. Each record is separated by a new line (\n) delimiter.
Aspire Hadoop MapReduce Classes
- HadoopContext - A Hadoop Context wrapper to write key/value pairs of type Text/AspireObjectWritable
- HadoopIterableWrapper - A wrapper over the Reducers Iterable<AspireObjectWritable> values.
Aspire Hadoop Factory Classes
- HadoopConfFactory - A factory class to create common Hadoop configuration objects from the org.apache.hadoop.conf package. See JavaDoc: HadoopConfFactory.
- HadoopFSFactory - A factory class to create common Hadoop file system objects from the org.apache.hadoop.fs package. See JavaDoc: HadoopFSFactory.

Interacting with Hadoop from Aspire Groovy Script

HadoopContext, HadoopIterableWrapper, HadoopConfFactory and HadoopFSFactory are available to be used in groovy scripts for custom coding.

Iterate over Reducer Values

HadoopIterableWrapper can be looped with a groovy closure. To access each AspireObject inside the closure, use the variable name it.

    def count = 0;
    hadoopIterable.each() {
      count ++;
      def url = it.getText("url");
    }
    doc.add("count", count);

Emit from groovy

To emit key/value pairs directly from a groovy script, use the HadoopContext write(key, value) method.

hadoopContext.write("key", new AspireObject("newDoc"));

Creating an Aspire Hadoop Job

Using AspireHadoopDriver

Using the generic Aspire Hadoop MapRed component, create a configuration XML as described here.

AspireHadoopDriver input and output key/value pairs are of type Text/AspireObjectWritable.

Creating a new Aspire-based Hadoop Job

Sometimes, you will want to take advantage of running Aspire inside of a Hadoop for one of your tasks--map or reduce-- but not both. For this, you will need to create a new Hadoop Driver. When individually using AspireHadoopMapper, AspireHadoopReducer or AspireHadoopCombiner, take into consideration the following:

The input/output pairs are: Text/AspireObjectWritable.
They all expect a Configuration property called aspire-home with the local path where the aspire-for-hadoop-2.0 folder is located.
An Aspire application.xml as a string is expected in a Configuration property called ${taskType}-application (where ${taskType} is either: map, reduce or combine depending on the task you are configuring).

Page tree