Groovy pipelines are pipelines where you can control the flow of the jobs through the stages, using a groovy script instead of a list of stages. For example:
1 |
|
job | Java Type = Job References the job which is being processed by this groovy pipeline. You can use this variable to process it through stages |
doc | Java Type = AspireObject The AspireObject which holds all of the metadata for the current document being processed by this groovy pipeline. This is the same as job.get() or job.doc - the job's data object. |
StageName | Every stage component configured in the actual Pipeline Manager is bound by its name to this groovy pipeline. This way you can reference the stages by using their configured names (i.e. job | FetchUrl | ExtractText). |
If you want to reference a stage configured outside the actual Pipeline Manager, you can reference it by using the path to that stage component:
|
The groovy pipelines allows you to dynamically build a list of stages to execute. This way you can have a better and easier control of what stages should and shouldn't be processed based on the input job metadata.
|
You can use the redirect feature to print to a file the contents of the jobs received in the actual groovy pipeline, using the ">>" operator and then specifying the target file path.
|
In the previous example the redirect is executed before the "Feed2Solr" stage, so if that stage adds or modify any content on the job metadata, it will not be reflected in the "out.xml" file.
A Closure Stage is an embedded stage (to the Groovy Pipeline) that receives a groovy closure to execute. For example:
|
You can use this to configure other job flows too:
|
Groovy control flow statements can be used to control what pipeline to execute given any condition you want:
|
You can loop through some stages as needed:
|
The previous example will produce the following job:
1 |
|
Groovy pipelines can also be configured with branches which determine what happens to a job/document when certain events occur. Those branches are configured the same way as in normal pipelines:
1 |
|
Stage exceptions are a way, inside groovy pipelines, to have the same control of branches/errors but handled independently by Stage. To configure it you have to call the exceptions() method of the stage to be configured, it receives a Map of labels vs Stage (or List of Stages) For example:
1 |
|
In this case, when a job completes successfully the FetchUrl stage, it will execute stage{job >> "fetchUrlCompleted.xml"} | stage{println "FetchUrl completed for "+job.jobId} before continuing with ExtractText. This is the same for onTerminate and onError exceptions.
For example, in the previous example, if Feed2Solr has an error, it will execute stage{job >> "Feed2SolrErrors.xml"} and then the job will continue to the next Stage, which is a redirect to "finished.xml" and then, at the end, the "onComplete" branch from the pipeline will be executed. If the "onError" exception wouldn't be configured in Feed2Solr stage, then any error thrown in this stage will be handled by the "onError" branch of the pipeline, and the execution of the pipeline will end at that moment, without executing the redirect to "finished.xml".
You can also configure exceptions to lists of Stages:
|
Nested exception handling is also available:
|
Groovy pipelines provide a way of controlling the flow of sub jobs through stages. Using the subJobs() method of each stage, you can specify what you want to execute for possible subjobs generated in that Stage. It receives a single Groovy Closure or a Map of label (used when the subJob was branched) vs a Stage (or a List of stages):
|
or just a single Closure that will be executed no matter what are the branch labels for the subjobs:
|
Note: Sub Job extractors need a dummy <branches>.
In the current design, when you create a sub job extractor for use in Groovy pipelines, you will need to create a dummy sub-job extractor. For example:
|
Otherwise the component will flag an error ("missing branch handler") when it is loaded.
A different Thread Pool Manager will be assigned to each Stage and parent job to process their subjobs.
To configure the maximum number of thread pools and their sizes.
pipeline/script/@maxThreadPools | 10 | The maximum number of thread pools to handle simultaneously by this Groovy pipeline for subjobs. If the maximum number of thread pools in use has been reached, then jobs that want to create new subjobs will have to wait until a thread pool is released by another job. |
pipeline/script/@maxThreadsPerPool | 10 | The maximum number of threads to create (per thread pool) to handle subjobs. |
pipeline/script/@maxQueueSizePerPool | 30 | The size of the queue (per thread pool) for processing subjobs. If the job queue is full, then feeders, which attempt to put a new job on the queue, will be blocked until the queue has room. It is recommended that the queue size be at least as large as the number of threads, if not two or three times larger. |
Example:
|
You can create jobs inside a Groovy Pipeline by using the createJob method:
|
You can use groovy pipelines to create jobs for each file and directory from a given path. For this purpose the groovy pipelines provides a function named 'dir'. There are 3 possible arguments:
Path | Aspire Home path | Directory where the files and directories will be fetched. If running AspireShell this it can be changed using 'cd' (change directory) command. |
Closure | Closure to execute with every job created for each file and directory | |
Arguments | "" | Specifies if directories should create jobs (using "+d") and if the extraction of files should be recursively (using "+r"). By default, if no Arguments specified only files will create jobs and will not crawl recursively. |
This function can be used in 4 different ways:
Each job created will have an <url> field pointing to the corresponding file/directory.
Example:
|