Page History
The Aspire Scheduler uses the Quartz scheduler from Terracotta (see http://www.quartz-scheduler.org/ for details) to provide the backbone for scheduling jobs. The Aspire scheduler provides a wrapper around the Quartz scheduler and contains all necessary Quartz classes required for operation. Thus (apart from the services and framework), the Aspire Scheduler has no dependencies.
Scheduler | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-scheduler |
subType | default |
Inputs | N/A |
Outputs | An AspireObject published to the configured pipeline manager. |
Upon startup, the Aspire Scheduler reads its configuration from the application.xml file and sets schedules within Quartz to represent each of the configured schedules. Each configuration will be given an id, known as the scheduleId. Optionally, the component will also attempt to read schedule information from a relational database.
The configuration of the schedules will include a “cron” style definition of the job execution time. The format of this definition is described later.
The Quartz scheduler is then started. If the scheduler and schedule are enabled, then when the scheduled time is reached, Quartz runs a java method. This method checks that the current schedule does not already have a job outstanding and publishes a job onto the configured Aspire pipeline. Each job published by the Aspire scheduler is given a unique job number known as the jobNumber.
Note: The data in the Aspire job is configurable.
Cron Style Execution Schedule
The configuration stored in the system.xml file defines the execution time using a “cron” style format. The format used by Quartz differs from some “cron” implementations and is described below.
Cron expressions provide the ability to specify complex time combinations such as "At 8:00am every Monday through Friday" or "At 1:30am every last Friday of the month".
Cron expressions are comprised of 6 required fields and one optional field separated by white space. The fields respectively are described as follows:
Field Name | Allowed Values | Allowed Special Characters |
---|---|---|
Seconds | 0-59 | , - * / |
Minutes | 0-59 | , - * / |
Hours | 0-23 | , - * / |
Day-of-month | 1-31 | , - * ? / L W |
Month | 1-12 or JAN-DEC | , - * / |
Day-of-Week | 1-7 or SUN-SAT | , - * ? / L # |
Year (Optional) | empty, 1970-2199 | , - * / |
The special characters are described below:
Character | Allowed fields | Description |
---|---|---|
* | all | Used to specify all values. For example, "*" in the minute field means "every minute". |
? | day-of-month, day-of-week | Used to specify 'no specific value'. This is useful when you need to specify something in one of the two fields, but not the other. |
- | all | Used to specify ranges. For example "10-12" in the hour field means "the hours 10, 11 and 12". |
, | all | Used to specify additional values. For example "MON,WED,FRI" in the day-of-week field means "the days Monday, Wednesday, and Friday". |
/ | all | Used to specify increments. For example "0/15" in the seconds field means "the seconds 0, 15, 30, and 45". And "5/15" in the seconds field means "the seconds 5, 20, 35, and 50". Specifying '*' before the '/' is equivalent to specifying 0 as the value to start with. Essentially, for each field in the expression, there is a set of numbers that can be turned on or off. For seconds and minutes, the numbers range from 0 to 59. For hours 0 to 23, for days of the month 0 to 31, and for months 1 to 12. The "/" character simply helps you turn on every "nth" value in the given set. Thus "7/6" in the month field only turns on month "7", it does NOT mean every 6th month, please note that subtlety. |
L | day-of-month, day-of-week | Short-hand for "last", but it has different meaning in each of the two fields. For example, the value "L" in the day-of-month field means "the last day of the month" - day 31 for January, day 28 for February on non-leap years. If used in the day-of-week field by itself, it simply means "7" or "SAT". But if used in the day-of-week field after another value, it means "the last xxx day of the month" - for example "6L" means "the last friday of the month". You can also specify an offset from the last day of the month, such as "L-3" which would mean the third-to-last day of the calendar month. When using the 'L' option, it is important not to specify lists, or ranges of values, as you'll get confusing/unexpected results. |
W | day-of-month | Used to specify the weekday (Monday-Friday) nearest the given day. As an example, if you were to specify "15W" as the value for the day-of-month field, the meaning is: "the nearest weekday to the 15th of the month". So if the 15th is a Saturday, the trigger will fire on Friday the 14th. If the 15th is a Sunday, the trigger will fire on Monday the 16th. If the 15th is a Tuesday, then it will fire on Tuesday the 15th. However if you specify "1W" as the value for day-of-month, and the 1st is a Saturday, the trigger will fire on Monday the 3rd, as it will not 'jump' over the boundary of a month's days. The 'W' character can only be specified when the day-of-month is a single day, not a range or list of days. The 'L' and 'W' characters can also be combined for the day-of-month expression to yield 'LW', which translates to "last weekday of the month". |
# | day-of-week | Used to specify "the nth" XXX day of the month. For example, the value of "6#3" in the day-of-week field means the third Friday of the month (day 6 = Friday and "#3" = the 3rd one in the month). Other examples: "2#1" = the first Monday of the month and "4#5" = the fifth Wednesday of the month. Note that if you specify "#5" and there is not 5 of the given day-of-week in the month, then no firing will occur that month. If the '#' character is used, there can only be one expression in the day-of-week field ("3#1,6#3" is not valid, since there are two expressions). The legal characters and the names of months and days of the week are not case sensitive. |
Notes
- Support for specifying both a day-of-week and a day-of-month value is not complete (you'll need to use the '?' character in one of these fields).
- Overflowing ranges is supported - that is, having a larger number on the left hand side than the right. You might do 22-2 to catch 10 o'clock at night until 2 o'clock in the morning, or you might have NOV-FEB. It is very important to note that overuse of overflowing ranges creates ranges that don't make sense and no effort has been made to determine which interpretation CronExpression chooses. An example would be "0 0 14-6 ? * FRI-MON".
Examples
- 0 10 20 * * ? This combination is legal and would fireup at 8:10pm on every day. Here, the star stands for Every month and every date and the question mark for Any day of the week.
- 0 10 20 ? * 1 This combination is legal and would fireup at 8:10pm on every Sunday. Here, the star stands for Every month and question mark for Any date.
- 0 10 20 * * 1 This combination IS NOT legal. Combination of All dates and Specific day is not accepted by Quartz.
- 0 10 20 ? ? SUN This combination IS NOT legal. Month can be specified as specific, a range, or All, but not Any.
- * 10 20 * * ? This combination is legal, but DANGEROUS as it would fire up 60 times, once for every second of the 10th minute after 8pm of every day.
Published Jobs
The basic configuration taken from the system.xml file allows the user to optionally specify in XML or JSON the data for the job that will be published to the pipeline. If specified, then the published job will be as configured, but the path to the scheduler, the sourceName, scheduleId and jobNumber will be added as attributes to the root tag (normally <doc>). The source id, action (start/stop/pause/resume), event type (scheduled/manual) and properties are also added.
If the job data is not specified, then an empty document is published onto the configured pipeline:
<doc scheduler="/path/schedulerName" scheduleId="1" jobNumber="1" sourceName="myJob" sourceId="XXXX" actionProperties="full" actionType="manual" crawlId="123" action="start"/>
Note
- sourceId is only available when the schedule has come from an RDB.
- crawlId is only available when the schedule has come from an RDB and the rdb/sql/getCrawlId SQL is configured.
This method may be used to trigger sub job processing where the contents of the job is irrelevant, but something is needed to start processing at a scheduled time.
Once jobs are published they will run to completion. Jobs that error will be logged to the scheduler log file. Jobs could run indefinitely as they will not timeout.
User interface
The user interface allows the administrator to view and update the schedules via the normal Aspire web interface.
On browsing to the Aspire Schedulers status page, the administrator is able to see the current schedules and their status. This includes the schedule, event, whether the schedule is currently enabled, its last and next execution time and whether it is currently running (i.e. has submitted a job which has not yet completed). Clicking on this schedule provides further information about the schedule, such as the job data, pipeline and last error response.
From the status page, the administrator is able to enable or disable individual schedules and enable or disable the scheduler.
The administrator is also able to add a new schedule, specifying the schedule, event, and optionally whether the schedule is enabled and is a singleton.
The administrator may also manually "fire" events, causing jobs to be published on to the Aspire pipeline. The administrator may send "start", "stop", "pause" and "resume" jobs. These jobs will specify the action in the action attribute and show "manual" as the actionType attribute.
Configuration
The scheduler recognizes the following configuration tags.
Element | Type | Default | Description |
---|---|---|---|
enabled | boolean | true | Whether the scheduler is enabled. If false, then no jobs will be submitted for any configured schedule. |
schedules | One or more schedules on which jobs will be fired. Also see the section on schedules stored in a database below. | ||
schedules/schedule | A schedule on which jobs will be fired. | ||
schedules/schedule/@name | String | The (optional) name for the schedule. | |
schedules/schedule/@enabled | boolean | true | Whether this specific schedule is enabled. If false, then no jobs will be submitted for this schedule. |
schedules/schedule/@singleton | boolean | true | Specifies that this schedule may only fire one job at a time. If true and the scheduled time is reached again, then a new job will only be published if the previous job has completed. |
schedules/schedule/cron | String | Mandatory for this schedule | Specifies the schedule in cron style (see above for the format). This must be specified for any schedule configured here. |
schedules/schedule/job | String | Specifies the job data that will be published when the scheduled time is reached. The data can be specified in either XML or JSON style (indicated by the type attribute – see below). The data will have the scheduler information added as attributes to the root node. If not specified, an empty document will be published. NOTE: this configuration item is a String and XML/JSON text should be surrounded with a <[CDATA[]]>. | |
schedules/schedule/job/@type | String | xml | Specifies style of the data in the <job> tag. Can be either xml or json. |
schedules/schedule/event | String | Mandatory for this schedule | Specifies the event to publish the job to. Must match one of the events configured in the branch handler <branches> configuration. |
quartz | N/A | Container for the properties to be passed to the Quartz Scheduler. | |
quartz/property | String | The value of the property to be passed to the Quartz Scheduler. | |
quartz/property/@name | String | The name of the property to be passed to the Quartz Scheduler. |
The scheduler can read its schedules from a database. To configure this, the following configuration can be used:
Element | Type | Description |
---|---|---|
rdb/@component | String | If schedules should be loaded from a database, this attribute holds the path to the Aspire database connection pool component (aspire-rdb). |
rdb/sql/schedules | String | If schedules should be loaded from a database, this element holds the SQL that will be used to extract the schedules from the database configured via the schedules/@rdb attribute. See below for the columns that should be returned. |
rdb/sql/jobRunningCheck | String | If schedules taken from the RDB are singletons, this SQL will be run when the schedule fires to check whether a job is still running. If not specified, no check on the database will be performed, but the existing check making sure that the number of outstanding jobs is 0 may still prevent the job from firing. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobStarted | String | This SQL is run when a job is started. Typically it is used to allow singleton control via an external database. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobStopped | String | This SQL is run when a stop job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobPaused | String | This SQL is run when a pause job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobResumed | String | This SQL is run when a resume job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobFinished | String | This SQL is run when a job finishes successfully. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/jobFailed | String | This SQL is run when a job finishes with an error. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. However, if the job failed, the external process may not have marked the job as complete, meaning singleton jobs would be blocked. The SQL provided is a template that has values substituted. See below for the values that may be substituted. |
rdb/sql/crawlId | String | The SQL used to determine the crawl id. If this SQL exists, it is run whenever a job is published and the result is added to the job in the crawlId attribute of the document. The first column of the first row of the result set is used as the crawl ID. |
rdb/autoReloadSchedules | long | Time in milliseconds between automatic reloads of the schedules from the RDB. If missing or 0, automatic reloads will be disabled. |
Database Schedule Selection SQL
The SQL should return the mandatory columns and may return the optional columns from the following:
Column | Description |
---|---|
name | The schedule name |
enabled | True if the schedule is enabled (defaults to true). |
singleton | True if this schedule is a singleton (defaults to true). |
cron | The cron schedule (mandatory). |
jobType | The type of data given in the jobData column (defaults to XML). |
jobData | The data to be sent in the job when the scheduled time is reached. This may be given in XML or JSON format as specified by the jobType column and should be given as a string. |
event | The event to publish the job on (mandatory). |
sourceId | The external ID (of the source) to be added to the job (if available). |
The format of the columns follows the formats given in the Basic Configuration section above. Column names can be enforced by use of the SQL “AS” keyword.
Database Job Control SQL
SQL contained in the jobRunningCheck, jobStarted, jobFinished and jobFailed may contain variables for substitution. Variables are surrounded with { } (see Simple Templates for more details). The following variables my be specified:
Variable | Available | Description |
---|---|---|
scheduler | always | The component name of the scheduler. |
scheduleId | always | The ID of the schedule that fired this job. |
sourceName | always | The name of the source that fired this job. |
sourceId | always | The source ID of the source that fired this job if available (from the sourceId column of the schedule SQL). |
jobNumber | jobStarted, jobStopped, jobPaused, jobResumed,jobFinished, jobFailed | The unique number allocated to this job from the scheduler. |
jobId | jobStarted, jobStopped, jobPaused, jobResumed,, jobFinished, jobFailed | The job ID associated to the Job object published for this schedule. |
jobSuccess | jobFinished, jobFailed | true if the job listener received a JobComplete event (i.e. the job completed the pipeline without failure), false otherwise. |
jobResult | jobFinished, jobFailed | XML representation of the result from the JobEvent. |
Branch Configuration
The Aspire Scheduler publishes jobs using the branch manager. Thus it requires the standard Branch Handler configuration detailed below:
Element | Type | Description |
---|---|---|
branches/branch/@event | String | The event to configure. At the very least, you should include the onPublish event. |
branches/branch/@pipelineManager | String | The URL of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | String | The name of the pipeline to publish to. |
branches/branch/@stage | String | The name of the stage to publish to. |
Example Configuration
<component name="myScheduler" subType="default" factoryName="aspire-scheduler"> <schedules> <schedule name="myFirstSchedule" enabled="false"> <cron>1/10 * * * * ?</cron> <event>onPublish</event> <job> <![CDATA[ <doc> <fetchUrl>support.searchtechnologies.com</fetchUrl> </doc> ]]> </job> </schedule> <schedule enabled="false"> <cron>2/10 * * * * ?</cron> <event>onPublish2</event> </schedule> <schedule enabled="false"> <cron>3/10 * * * * ?</cron> <event>onPublish3</event> <job type="json"> <![CDATA[ { "doc" : { "fetchUrl" : "www.searchtechnologies.com" } } ]]> </job> </schedule> <schedule enabled="false"> <cron>4/10 * * * * ?</cron> <event>onPublish4</event> <job type="json"> <![CDATA[ { "doc" : { "fetchUrl" : "repositories.searchtechnologies.com" } } ]]> </job> </schedule> </schedules> <branches> <branch event="onPublish" pipelineManager="PipelineManager" /> <branch event="onPublish2" pipelineManager="PipelineManager" pipeline="myPipeline" /> <branch event="onPublish3" pipelineManager="PipelineManager" pipeline="myPipeline" stage="myStage" /> <branch event="onPublish4" pipelineManager="PipelineManager-not-exist" /> </branches> </component>
Servlet Commands
The following servlet commands are available via the scheduler (via http://server:port/scheduler?cmd=XXXX¶m=value):
Command | Description | Parameters |
---|---|---|
add | Adds a schedule to the scheduler | event: the event the schedule should publish to cron: the cron schedule |
delete | Deletes a schedule from the scheduler | extId: the external ID of the schedule to be deleted (optional, but this or schedId must be specified) schedId: the ID of the schedule to be deleted (optional, but this or extId must be specified) |
disable | Disables the scheduler, or a schedule if specified | extId: the external ID of the schedule to be disabled (optional) schedId: the ID of the schedule to be disabled (optional) |
enable | Enables the scheduler, or a schedule if specified | extId: the external ID of the schedule to be enabled (optional) schedId: the ID of the schedule to be enabled (optional) |
reload | Reloads all the schedules from the database. | None |
start | Sends a 'start' job for the given schedule | extId: the source (external) ID of the schedule to be started (optional, but this or schedId must be specified) schedId: the ID of the schedule to be started (optional, but this or extId must be specified) |
stop | Sends a 'stop' job for the given schedule | extId: the source (external) ID of the schedule to be stopped (optional, but this or schedId must be specified) schedId: the ID of the schedule to be stopped (optional, but this or extId must be specified) |
pause | Sends a 'pause' job for the given schedule | extId: the source (external) ID of the schedule to be paused (optional, but this or schedId must be specified) schedId: the ID of the schedule to be paused (optional, but this or extId must be specified) |
resume | Sends a 'resume' job for the given schedule | extId: the source (external) ID of the schedule to be resumed (optional, but this or schedId must be specified) schedId: the ID of the schedule to be resumed (optional, but this or extId must be specified) |
Services Interface
Other components will be able to access the scheduler via a number of methods. These are made available via two interfaces – one to handle the schedules and one to handle the scheduler.
The component exposes the following interface to handle jobs:
AspireSchedule.java
The component will expose the following interface to handle the scheduler:
AspireScheduler.java