Introduction

Application development in Aspire is all about configuring components and then stringing them together into pipelines.

A typical application will contain:

One or more feeders
Feeders create jobs and send them to pipelines
One or more pipeline managers
Pipeline Managers (PMs) receive jobs and process them through pipelines
Lots of components
Components process jobs, doing something specific to the job
Some method for outputting the results
This is a component too, but typically a component that writes to some external item, like a file, relational database, HDFS or search engine.

application.xml

An Aspire application is encapsulated in an XML configuration file. You will find an example of this file in your distribution, in $ASPIRE_HOME/config/application.xml. See Application Configuration for more information.

You can have multiple Application XML files

A single Aspire instance can load multiple application.xml files (with different names). All files can be active at the same time.

Multiple configuration files helps make configuration control simpler. For example, you can have a different application.xml for every collection, which makes it easy to add or remove new collections of data to your system.

Application XML files are simply lists of components

Basically, an application.xml file holds a simple list of components. Each component will have configuration data which is used to initialize the component and control what it does.

Some components are feeders that produce jobs. And of course, some components are pipeline managers, which themselves have nested components.

Application XML files can be stored and downloaded from Maven repositories

When this happens, the application is called an "App Bundle."

App Bundles are a convenient method for sharing applications across many Aspire instances, spread across a network or spread across the world.

Typical layout of an Application

An application, as specified in an application.xml file, typically has the following structure:

<application name="MyAppName">
  <components>
    <-- Feeder Components -->
    <component> . . . </component>
    <component> . . . </component>
    .
    .
    .

    <-- Pipeline Managers -->
    <component name="..." subType="pipeline" factoryName="aspire-application">
      <pipelines>
        <pipeline>
          <stage .../>
          <stage .../>
          <stage .../>
          .
          .
          .
        </pipeline>
      </pipelines>

      <components>
        <-- Components used by and nested within the pipeline manager -->
        <component> . . . </component>
        <component> . . . </component>
        .
        .
        .
      </components>
    </component>


    <-- More pipeline managers (or other components) usually go here -->
    .
    .
    .
  </components>
</application>

The Pipeline Manager

As you might expect, the Pipeline Manager (PM) plays a pivotol role in Aspire.

Pipelines are sequences of components. PM's receive jobs and process jobs through pipelines. There are two methods that a PM can process a job:

process() - Synchronous job processing
When a PM "processes" a job, it means that the job is processed immediately. In this situation, it is the thread that calls the process() method, which is used to actually carry the job through each of the components in the pipeline.

enqueue() - Asynchronous job processing
Jobs that are enqueue()'d are placed on an input queue to be processed by the PM at a future time (or right away, if a thread is available).

Of these two methods, enqueue() is used the most. To manage enqueue(), the PM maintains two structures: the job queue and the thread pool.

The Job Queue

Every PM maintains an input queue of jobs. This queue has a maximum upper limit which can be set with <queueSize> parameter, e.g., <queueSize>30</queueSize>.

If the queue is full, the feeder that is submitting the job will be blocked. If the queue remains full, after a timeout period, an exception will be thrown to the feeder.

The Thread Pool

Every PM maintains a thread pool. Threads will be created as necessary to process jobs on the Job Queue, and then will be shut down if they are idle for a timeout period.

The PM specifies a maximum number of threads. This maximum can be set with the <maxTrheads> parameter.

The maximum number of threads can also be dynamically adjusted on the user interface.

Job Branching and Routing

There are three different ways in which a job can move around the system.

"Normal Flow" - From pipeline stage to pipeline stage
A pipeline is a sequence of stages managed by the PM. Once a job is submitted to the pipeline, the PM will automatically send the job to each stage in turn.
"Branching" - From one pipeline to another
Jobs can be branched from one pipeline to another with "branches." Branches are all handled by the Branch Handler, which specifies the destination PM and the pipeline (within the named pipeline manager) to which the job will branched. Pipeline Branching occurs when a job has some event (such as "onComplete" or "onError"). These are defined in thetag of the PM. Sub Job Branching occurs when sub-jobs are created and branched to pipelines. These are defined as part of the Sub Job extractor component.
"Routing" - Dynamic routes attached to a job
Routing tables can be dynamically generated and attached to jobs. This is unlike branching, which are specified in the XML file. Routing also occurs at a higher level than branching. Once a job is routed to a PM, the PM takes over and is in full control of the job, which may be branched around using the Branch Handler any number of times. Only once the job is completely done with the PM, i.e., when it is "complete", is it then routed to the next PM in the routing table.

Parent Jobs and Sub-jobs

Perhaps the most powerful aspect of Aspire is its ability to create sub-jobs from parent jobs. Once one understands how this works, it opens up endless possibilities.

Let's start with a few examples.

Example 1: Processing a directory of files

JOB: Use Feed One to initiate the job.

Parent job holds the path of the directory to scan.
Scan Directory is used to create sub-jobs for each file in the directory.

SUB-JOB: One sub-job for every file Fetch URL to fetch the content of the document. Extract Text to extract text from each document Post HTTP to send the document and its text to the search engine.

Example 2: Processing Wikipedia Dump Files

See the Wikipedia blog entry for a complete description.

JOB: Use Feed One to initiate the job.

Parent job holds the URL of the Wikipedia dump files (http://dumps.wikimedia.org/enwiki/latest/).
Fetch URL to fetch the URL from Wikipedia.
XML Loader to load the XHTML into an AspireObject.
Groovy Scripting to locate the compressed articles files and spawn off a sub-job for each one.

SUB-JOB: One sub-job for every BZIP2 compressed XML file. Fetch URL to open a stream on the BZIP2 compressed XML file. BZip2 Decompress Stream (Aspire 2) to decompress the file. XML Sub Job Extractor (Aspire 2) to extract each wikipedia page from the larger XML dump file. Each wikipedia page is spawned as a sub-job.

SUB-SUB-JOB: Processes each wikipedia page Groovy Scripting to terminate pages which are redirects, to extract categories, to identify disambiguation pages, and to clean up the static document teasers. Post HTTP (Aspire 2) to send the document and its text to the search engine.

Example 3: Processing 70 Million Database Records

Processing large numbers of records must be done in batches, otherwise the database may be locked for long periods of time, preventing anyone else from using it.

JOB: Use Feed One to initiate the job.

Groovy Scripting to create 10,000 batches, each with an ID from 0 to 9,999. Each batch is submitted as a sub-job.

SUB-JOB: One sub-job for each batch of records. RDB Sub Job Feeder (Aspire 2) to select out all records for the specified batch. This is doing by looking for all records where the [record ID] modulo [10,000] is equal to the [batch ID]. The RDB sub job feeder will submit each individual record as a sub-sub-job.

SUB-SUB-JOB: Processes each individual record Post HTTP to send the document and its text to the search engine.

Multiple Pipeline Managers

Best practices in Aspire is to create a separate Pipeline Manager (PM) every time you create sub-jobs from a parent job. For example, in the Wikipedia example above, there would be three PM's:

One to handle the parent job (initiate the index run)
One to handle each BZIP2 file
One to handle each individual Wikipedia page.

Why so many Pipeline Managers? Why not just one to do everything?

The issue has to do with thread starvation. Suppose you had just a single pool of (say) 10 threads in a single PM. What happens is that if you are processing more than 10 BZIP2 files, then all threads may be used up processing these files. This would leave no threads left-over to process the actual Wikipedia pages and then the system would grind to a halt.

Using separate PM's for each level of job neatly avoids this issue. Since each PM has its own thread pool, there can never be a situation where parent jobs use up all of the threads leaving nothing left over for the sub-jobs. Thread pools for different levels of jobs are kept separate, which assures the high performance even with very complex structures.

Sub Job Extractors

The common denominator in all of the examples above is that they all contain "sub job extractors". These are components which divide up a larger job into smaller pieces, and then spawn off separate sub-jobs for each piece.

Every Sub Job Extractor will be configured with a Branch Handler, which specifies where the sub-jobs should be sent after they have been created. Note that the Branch Handler can also branch jobs to remote servers and can also combine jobs into batches.

Some of the more useful Sub Job Extractors include:

XML Sub Job Extractor - Assumes InputStream is XML and splits it into multiple sub-jobs. Every tag underneath the root tag becomes a separate sub-job.
Tabular Files Extractor - Assumes InputStream is a CSV or tab-delimited file. Every row of the file becomes a new sub-job.
RDB Sub Job Feeder - Executes a SQL select statement (which can have substitutable parameters that are filled in with job metadata) and submits all of the selected records as separate sub-jobs.
Scan Directory - Scans through a directory and submits all of the files as separate sub-jobs. Can also do recursive directory scans including sub-folders.
Groovy Scripting - Can be used to do any sort of loop which creates sub-jobs and branches them.

All of the scanners (see below) are also, technically, Sub Job Extractors as well.

Page tree

Introduction to Aspire Applications

Introduction

application.xml

Typical layout of an Application

The Pipeline Manager

The Job Queue

The Thread Pool

Job Branching and Routing

Parent Jobs and Sub-jobs

Multiple Pipeline Managers

Sub Job Extractors