Application development in Aspire is all about configuring components and then stringing them together into pipelines.
A typical application will contain:
Feeders create jobs and send them to pipelines
Pipeline Managers (PMs) receive jobs and process them through pipelines
Components process jobs, doing something specific to the job
This is a component too, but typically a component that writes to some external item, like a file, relational database, HDFS or search engine.
An Aspire application is encapsulated in an XML configuration file. You will find an example of this file in your distribution, in $ASPIRE_HOME/config/application.xml. See Application Configuration for more information.
You can have multiple Application XML files
A single Aspire instance can load multiple application.xml files (with different names). All files can be active at the same time.
Multiple configuration files helps make configuration control simpler. For example, you can have a different application.xml for every collection, which makes it easy to add or remove new collections of data to your system.
Application XML files are simply lists of components
Basically, an application.xml file holds a simple list of components. Each component will have configuration data which is used to initialize the component and control what it does.
Some components are feeders that produce jobs. And of course, some components are pipeline managers, which themselves have nested components.
Application XML files can be stored and downloaded from Maven repositories
When this happens, the application is called an "App Bundle."
App Bundles are a convenient method for sharing applications across many Aspire instances, spread across a network or spread across the world.
An application, as specified in an application.xml file, typically has the following structure:
<application name="MyAppName"> <components> <-- Feeder Components --> <component> . . . </component> <component> . . . </component> . . . <-- Pipeline Managers --> <component name="..." subType="pipeline" factoryName="aspire-application"> <pipelines> <pipeline> <stage .../> <stage .../> <stage .../> . . . </pipeline> </pipelines> <components> <-- Components used by and nested within the pipeline manager --> <component> . . . </component> <component> . . . </component> . . . </components> </component> <-- More pipeline managers (or other components) usually go here --> . . . </components> </application>
As you might expect, the Pipeline Manager (PM) plays a pivotol role in Aspire.
Pipelines are sequences of components. PM's receive jobs and process jobs through pipelines. There are two methods that a PM can process a job:
When a PM "processes" a job, it means that the job is processed immediately. In this situation, it is the thread that calls the process() method, which is used to actually carry the job through each of the components in the pipeline.
Jobs that are enqueue()'d are placed on an input queue to be processed by the PM at a future time (or right away, if a thread is available).
Of these two methods, enqueue() is used the most. To manage enqueue(), the PM maintains two structures: the job queue and the thread pool.
Every PM maintains an input queue of jobs. This queue has a maximum upper limit which can be set with <queueSize> parameter, e.g., <queueSize>30</queueSize>.
If the queue is full, the feeder that is submitting the job will be blocked. If the queue remains full, after a timeout period, an exception will be thrown to the feeder.
Every PM maintains a thread pool. Threads will be created as necessary to process jobs on the Job Queue, and then will be shut down if they are idle for a timeout period.
The PM specifies a maximum number of threads. This maximum can be set with the <maxTrheads> parameter.
The maximum number of threads can also be dynamically adjusted on the user interface.
There are three different ways in which a job can move around the system.
A pipeline is a sequence of stages managed by the PM. Once a job is submitted to the pipeline, the PM will automatically send the job to each stage in turn.
Jobs can be branched from one pipeline to another with "branches." Branches are all handled by the Branch Handler, which specifies the destination PM and the pipeline (within the named pipeline manager) to which the job will branched.Pipeline Branching occurs when a job has some event (such as "onComplete" or "onError"). These are defined in the <pipeline> tag of the PM.Sub Job Branching occurs when sub-jobs are created and branched to pipelines. These are defined as part of the Sub Job extractor component.
Routing tables can be dynamically generated and attached to jobs. This is unlike branching, which are specified in the XML file. Routing also occurs at a higher level than branching. Once a job is routed to a PM, the PM takes over and is in full control of the job, which may be branched around using the Branch Handler any number of times. Only once the job is completely done with the PM, i.e., when it is "complete", is it then routed to the next PM in the routing table.
Perhaps the most powerful aspect of Aspire is its ability to create sub-jobs from parent jobs. Once one understands how this works, it opens up endless possibilities.
Let's start with a few examples.
Example 1: Processing a directory of files
JOB: Use Feed One to initiate the job.
SUB-JOB: One sub-job for every file Fetch URL to fetch the content of the document.Extract Text to extract text from each documentPost HTTP to send the document and its text to the search engine.
Example 2: Processing Wikipedia Dump Files
See the Wikipedia blog entry for a complete description.
JOB: Use Feed One to initiate the job.
SUB-JOB: One sub-job for every BZIP2 compressed XML file. Fetch URL to open a stream on the BZIP2 compressed XML file.BZip2 Decompress Stream (Aspire 2) to decompress the file.XML Sub Job Extractor to extract each wikipedia page from the larger XML dump file. Each wikipedia page is spawned as a sub-job.
SUB-SUB-JOB: Processes each wikipedia page Groovy Scripting to terminate pages which are redirects, to extract categories, to identify disambiguation pages, and to clean up the static document teasers.Post HTTP to send the document and its text to the search engine.
Example 3: Processing 70 Million Database Records
Processing large numbers of records must be done in batches, otherwise the database may be locked for long periods of time, preventing anyone else from using it.
JOB: Use Feed One to initiate the job.
SUB-JOB: One sub-job for each batch of records. RDB Sub Job Feeder (Aspire 2) to select out all records for the specified batch. This is doing by looking for all records where the [record ID] modulo [10,000] is equal to the [batch ID]. The RDB sub job feeder will submit each individual record as a sub-sub-job.
SUB-SUB-JOB: Processes each individual record Post HTTP to send the document and its text to the search engine.
Best practices in Aspire is to create a separate Pipeline Manager (PM) every time you create sub-jobs from a parent job. For example, in the Wikipedia example above, there would be three PM's:
Why so many Pipeline Managers? Why not just one to do everything?
The issue has to do with thread starvation. Suppose you had just a single pool of (say) 10 threads in a single PM. What happens is that if you are processing more than 10 BZIP2 files, then all threads may be used up processing these files. This would leave no threads left-over to process the actual Wikipedia pages and then the system would grind to a halt.
Using separate PM's for each level of job neatly avoids this issue. Since each PM has its own thread pool, there can never be a situation where parent jobs use up all of the threads leaving nothing left over for the sub-jobs. Thread pools for different levels of jobs are kept separate, which assures the high performance even with very complex structures.
The common denominator in all of the examples above is that they all contain "sub job extractors". These are components which divide up a larger job into smaller pieces, and then spawn off separate sub-jobs for each piece.
Every Sub Job Extractor will be configured with a Branch Handler, which specifies where the sub-jobs should be sent after they have been created. Note that the Branch Handler can also branch jobs to remote servers and can also combine jobs into batches.
Some of the more useful Sub Job Extractors include:
All of the scanners (see below) are also, technically, Sub Job Extractors as well.
Feeders generate jobs. All Aspire applications will have feeders of some sort or another.
There are a couple types of feeders:
Recommendation: Feed One is always a good place to start, even if you only end up using it for debugging.
The Scheduler has a stored list of jobs, and submits them to a processing pipeline following a given schedule.
Scanners are like feeders in that they scan through external servers and create jobs, but they operate at a higher level:
Pull feeders poll external databases for things to process, and then pull those items into the system and submit them to pipelines as jobs.
Pull feeders include any of the following components:
Push feeders are passive. They respond to outside events from external agents who push new jobs to Aspire.
The Scheduler also generates jobs like a feeder, but it is not based on an external event. Instead, it loads a processing schedule and submits jobs at particular intervals or particular times of the day or week.
Although similar to a feeder, the Scheduler is a special type of component. It is installed apart from any pipelines and “lives” from when Aspire starts up until system shutdown. Normally you use it together with scanners to schedule periodic scans of a repository.
Scanners typically receive jobs from the Scheduler, which includes all of the information needed to specify all of the details of the content source (server, username, password, directory, etc.). Scanners can be used as components in your own application as long as you provide all of the necessary information as a job that you send to the scanner.
Scanners available include:
Every pipeline Stage is an Aspire Component. The only difference is that stages have a process(job) method implemented that can process jobs when called upon by the Pipeline Manager.
The most useful pipeline stage is the Groovy Scripting stage. This stage can perform just about any metadata manipulation or I/O function that you might require. Very often, Aspire components start off as Groovy scripting stages and then are migrated to full Java components.
Some processors open up input streams to external content. These input streams are actual Java InputStream objects from which bytes can be read.
Once the InputStream is open, a later stage can read data from the input stream and process it.
Once an InputStream is open, you can do a number of different things with it.
It is expected that more compression / decompression components will be created in the future.
Sub Job Extractors That Read InputStreams
By far the most important content processor is the Groovy Stage, which can do just about anything, based on a script which you write directly into the Application XML file.
Other content processors include:
Some components do not process jobs, but instead provide a service, which is used by other components.
Sometimes the service is simply to make certain Java classes available to components that need them. This is primarily the case for the aspire-lucene component, for example. Services in this category include:
And in other cases, the service may be to create a pool of resources that can be drawn upon as needed. This is the case for the RDBMS Connection component, which maintains a pool of open relational database connections. Services in this category include: