Pipelines

Most pipeline configurations are a simple list of stages, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13

<pipelines>
  <pipeline name="doc-process" default="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
</pipelines>

Enabling and Disabling Pipelines and Stages

In a similar manner to components, pipelines and references to stages can be disabled using @enable or @disable attributes. If both @enabled and @disabled flags are specified, the value of @enable takes precedence. Disabled pipelines are completely removed from the system, as if they had never been written into the XML file at all. In the case of the stage reference, disabling removes the reference to the stage from the pipeline, but does not alter the component definition for the stage.

These flags are useful for turning on or off pipelines and references to stages in response to property settings (either as an App Bundle or via property settings specified in the settings.xml file).

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

<pipelines>
  
  <pipeline name="doc-process1" enable="false">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
  <pipeline name="doc-process2" disable="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>

  
  <pipeline name="doc-process3" enable="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" enable="false" />
      <stage component="dateChooser" disable="true" />
      <stage component="extractDomain" enable="false" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
</pipelines>

If neither @enable or @disable are present, then it is assumed that the pipeline or stage is enabled.

Pipeline Configuration

pipeline/@name	The name of the pipeline. Can be used to branch from one pipeline to another (see branching statements below).
pipeline/@default	"true" if the pipeline is the default pipeline for the pipeline manager. Jobs sent to the pipeline manager will be automatically sent to the default pipeline unless another pipeline is specified by name.
pipeline/@enable	True if the the pipeline should be enabled.
pipeline/@disable	True if the the pipeline should be disabled.
pipeline/stages/stage	The list of stages which make up the pipeline. Each pipeline is a single linear list of stages.
pipeline/stage/@component	The name of the component which will serve as the pipeline stage. Note that all pipeline stages are also Aspire components (the reverse is not true).
pipeline/stage/@enable	True if the the stage should be enabled.
pipeline/stage/@disable	True if the the stage should be disabled.

Typically these references are "local" references, i.e., references to components defined within the same pipeline manager. However, it is perfectly okay to use absolute path names, such as /Common/OtherPipelineManager/OtherStage, or relative paths, such as ../OtherPipelineManager/OtherStage, as the component attribute. In this way you can share components across pipeline manager configurations.

Note, however, that sharing components in this way is rarely required. Only do this if the component contains some large resource (such as a dictionary loaded into RAM) that needs to be shared to preserve memory.

Pipeline Branches

Pipelines can also be configured with branches which determine what happens to a job/document when certain events occur. Branches are configured inside the pipeline using a <branches> tag, like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

<pipelines>
  <pipeline name="doc-process" default="true">
    <stages>
      <stage component="FetchUrl" />
      <stage component="ExtractText" />
      <stage component="Splitter" />
      <stage component="DateChooser" />
      <stage component="ExtractDomain" />
      <stage component="PrintToFile" />
      <stage component="Feed2Solr" />
    </stages>
    <branches>
      <branch event="onError" pipeline="error-pipeline" />
      <branch event="onComplete" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" />
      <branch event="onMyEvent" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" stage="some-stage"/>
    </branches>
  </pipeline>

  <pipeline name="error-pipeline">
    
    .
    .
    .
  </pipeline>
</pipelines>

If @pipelineManager is not specified, then the event will branch to the same pipeline manager. If @pipeline is not specified, the event will branch to the same pipeline on this pipeline manager (if @pipelineManager is not given), or the default pipeline on the specified pipeline manager. If @stage is specified, then the processing of the job will continue with that stage (which could be in the middle of the pipeline), on the pipeline manager and pipeline determined by the above rules.

There are three built-in events which can be triggered for a job which is being processed by the pipeline:

onError	If any exception error is thrown by a pipeline stage processing a job, the pipeline manager will look for an "onEvent" branch and will route the job to the specified destination if it exists.
onComplete	When the job has completed a pipeline, the pipeline manager will look for an "onComplete" branch. If it exists, the job will be routed to the specified destination.
onTerminate	If any job is terminated by a pipeline stage (note: this is different than an exception error, see below), the pipeline manager will check for an "onTerminate" event and if found will route the terminated job to the specified destination. Once the job is routed, it no longer becomes "terminated" and then it continues as before.

However, other components may raise other events.

Terminating Jobs

There are many cases where a job will need to be terminated. Note that "termination" is not the same as "exception" or "error". Jobs that are terminated are still considered to be "successful". Basically, termination means that the job (or sub-job) just skips the rest of the pipeline.

Termination is useful for any situation where you do not want to further process a job, i.e., it allows stages to "filter out" jobs from the pipeline. This is typically used for documents that contain some sort of expected situation that indicates the job should not be indexed, e.g., if the document is a duplicate of some other document, or maybe it doesn't contain enough domain keywords, or perhaps it was used as a starting point by the crawler but it is not desired to index the document itself.

Termination is implemented with a "terminate()" method. When called, terminate() sets a termination flag which is checked by the pipeline manager as soon as the current stage is complete. Jobs with the flag set will skip all remaining stages of the pipeline. Note that jobs also have a setTerminate(flag) and getTerminate() methods so you can check and set/clear the flag as much as you'd like. These methods can be used both in stages and in groovy scripts.

Also note that pipelines can have branches and that a new _optional_ "onTerminate" event has been added to the pipeline manager.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

   <pipeline name="test2">
    <stages>
      <stage component="Schwarzenegger"/>
      <stage component="OldFashionedSgml"/>
    </stages>
    <branches>
      <branch event="onTerminate" pipeline="process-terminate"/>
    </branches>
  </pipeline>
   <pipeline name="process-terminate">
    <stages>
      <stage component="NewFangledXml"/>
      <stage component="AndAnother"/>
    </stages>
  </pipeline>

In the above example, the "Schwarzenegger" stage causes the job to be terminated (Arnold is the Terminator, right?). This is trapped by the pipeline's "onTerminate" branch, which then sends the job to the "process-terminate" pipeline where it continues.

Again, note that having the branch is purely optional. If it doesn't exist, the job will simply skip all remaining stages in the pipeline and then exit.

Page tree

Pipelines

Enabling and Disabling Pipelines and Stages

Pipeline Configuration

Pipeline Branches

Terminating Jobs