The Aspire Scheduler uses the Quartz scheduler from Terracotta (see http://www.quartz-scheduler.org/ for details) to provide the backbone for scheduling jobs. The Aspire scheduler provides a wrapper around the Quartz scheduler and contains all necessary Quartz classes required for operation. Thus (apart from the services and framework), the Aspire Scheduler has no dependencies.

 

Scheduler
Factory Namecom.searchtechnologies.aspire:aspire-scheduler
subType

default

InputsN/A
OutputsAn AspireObject published to the configured pipeline manager.

 

 

Upon startup, the Aspire Scheduler reads its configuration from the application.xml file and sets schedules within Quartz to represent each of the configured schedules. Each configuration will be given an id, known as the scheduleId. Optionally, the component will also attempt to read schedule information from a relational database.

The configuration of the schedules will include a “cron” style definition of the job execution time. The format of this definition is described later.

The Quartz scheduler is then started. If the scheduler and schedule are enabled, then when the scheduled time is reached, Quartz runs a java method. This method checks that the current schedule does not already have a job outstanding and publishes a job onto the configured Aspire pipeline. Each job published by the Aspire scheduler is given a unique job number known as the jobNumber.

Note:  The data in the Aspire job is configurable.

Cron Style Execution Schedule

The configuration stored in the system.xml file defines the execution time using a “cron” style format. The format used by Quartz differs from some “cron” implementations and is described below.

Cron expressions provide the ability to specify complex time combinations such as "At 8:00am every Monday through Friday" or "At 1:30am every last Friday of the month".

Cron expressions are comprised of 6 required fields and one optional field separated by white space. The fields respectively are described as follows:

Field NameAllowed ValuesAllowed Special Characters
Seconds0-59, - * /
Minutes0-59, - * /
Hours0-23, - * /
Day-of-month1-31, - * ? / L W
Month1-12 or JAN-DEC, - * /
Day-of-Week1-7 or SUN-SAT, - * ? / L #
Year (Optional)empty, 1970-2199, - * /


The special characters are described below:

CharacterAllowed fieldsDescription
*allUsed to specify all values. For example, "*" in the minute field means "every minute".
?day-of-month, day-of-weekUsed to specify 'no specific value'. This is useful when you need to specify something in one of the two fields, but not the other.
-allUsed to specify ranges. For example "10-12" in the hour field means "the hours 10, 11 and 12".
,allUsed to specify additional values. For example "MON,WED,FRI" in the day-of-week field means "the days Monday, Wednesday, and Friday".
/allUsed to specify increments. For example "0/15" in the seconds field means "the seconds 0, 15, 30, and 45". And "5/15" in the seconds field means "the seconds 5, 20, 35, and 50". Specifying '*' before the '/' is equivalent to specifying 0 as the value to start with. Essentially, for each field in the expression, there is a set of numbers that can be turned on or off. For seconds and minutes, the numbers range from 0 to 59. For hours 0 to 23, for days of the month 0 to 31, and for months 1 to 12. The "/" character simply helps you turn on every "nth" value in the given set. Thus "7/6" in the month field only turns on month "7", it does NOT mean every 6th month, please note that subtlety.
Lday-of-month, day-of-weekShort-hand for "last", but it has different meaning in each of the two fields. For example, the value "L" in the day-of-month field means "the last day of the month" - day 31 for January, day 28 for February on non-leap years. If used in the day-of-week field by itself, it simply means "7" or "SAT". But if used in the day-of-week field after another value, it means "the last xxx day of the month" - for example "6L" means "the last friday of the month". You can also specify an offset from the last day of the month, such as "L-3" which would mean the third-to-last day of the calendar month. When using the 'L' option, it is important not to specify lists, or ranges of values, as you'll get confusing/unexpected results.
Wday-of-monthUsed to specify the weekday (Monday-Friday) nearest the given day. As an example, if you were to specify "15W" as the value for the day-of-month field, the meaning is: "the nearest weekday to the 15th of the month". So if the 15th is a Saturday, the trigger will fire on Friday the 14th. If the 15th is a Sunday, the trigger will fire on Monday the 16th. If the 15th is a Tuesday, then it will fire on Tuesday the 15th. However if you specify "1W" as the value for day-of-month, and the 1st is a Saturday, the trigger will fire on Monday the 3rd, as it will not 'jump' over the boundary of a month's days. The 'W' character can only be specified when the day-of-month is a single day, not a range or list of days.

The 'L' and 'W' characters can also be combined for the day-of-month expression to yield 'LW', which translates to "last weekday of the month".

#day-of-weekUsed to specify "the nth" XXX day of the month. For example, the value of "6#3" in the day-of-week field means the third Friday of the month (day 6 = Friday and "#3" = the 3rd one in the month). Other examples: "2#1" = the first Monday of the month and "4#5" = the fifth Wednesday of the month. Note that if you specify "#5" and there is not 5 of the given day-of-week in the month, then no firing will occur that month. If the '#' character is used, there can only be one expression in the day-of-week field ("3#1,6#3" is not valid, since there are two expressions).

The legal characters and the names of months and days of the week are not case sensitive.


Notes

  • Support for specifying both a day-of-week and a day-of-month value is not complete (you'll need to use the '?' character in one of these fields).
  • Overflowing ranges is supported - that is, having a larger number on the left hand side than the right. You might do 22-2 to catch 10 o'clock at night until 2 o'clock in the morning, or you might have NOV-FEB. It is very important to note that overuse of overflowing ranges creates ranges that don't make sense and no effort has been made to determine which interpretation CronExpression chooses. An example would be "0 0 14-6 ? * FRI-MON".

Examples

  • 0 10 20 * * ? This combination is legal and would fireup at 8:10pm on every day. Here, the star stands for Every month and every date and the question mark for Any day of the week.
  • 0 10 20 ? * 1 This combination is legal and would fireup at 8:10pm on every Sunday. Here, the star stands for Every month and question mark for Any date.
  • 0 10 20 * * 1 This combination IS NOT legal. Combination of All dates and Specific day is not accepted by Quartz.
  • 0 10 20 ? ? SUN This combination IS NOT legal. Month can be specified as specific, a range, or All, but not Any.
  • * 10 20 * * ? This combination is legal, but DANGEROUS as it would fire up 60 times, once for every second of the 10th minute after 8pm of every day.

Published Jobs

The basic configuration taken from the system.xml file allows the user to optionally specify in XML or JSON the data for the job that will be published to the pipeline. If specified, then the published job will be as configured, but the path to the scheduler, the sourceName, scheduleId and jobNumber will be added as attributes to the root tag (normally <doc>). The source id, action (start/stop/pause/resume), event type (scheduled/manual) and properties are also added.

If the job data is not specified, then an empty document is published onto the configured pipeline:

 <doc scheduler="/path/schedulerName" scheduleId="1" jobNumber="1" sourceName="myJob" sourceId="XXXX" actionProperties="full" actionType="manual" crawlId="123" action="start"/>

Note

  • sourceId is only available when the schedule has come from an RDB.
  • crawlId is only available when the schedule has come from an RDB and the rdb/sql/getCrawlId SQL is configured.

This method may be used to trigger sub job processing where the contents of the job is irrelevant, but something is needed to start processing at a scheduled time.

Once jobs are published they will run to completion. Jobs that error will be logged to the scheduler log file. Jobs could run indefinitely as they will not timeout.

User interface

The user interface allows the administrator to view and update the schedules via the normal Aspire web interface.

On browsing to the Aspire Schedulers status page, the administrator is able to see the current schedules and their status. This includes the schedule, event, whether the schedule is currently enabled, its last and next execution time and whether it is currently running (i.e. has submitted a job which has not yet completed). Clicking on this schedule provides further information about the schedule, such as the job data, pipeline and last error response.

From the status page, the administrator is able to enable or disable individual schedules and enable or disable the scheduler.

The administrator is also able to add a new schedule, specifying the schedule, event, and optionally whether the schedule is enabled and is a singleton.

The administrator may also manually "fire" events, causing jobs to be published on to the Aspire pipeline. The administrator may send "start", "stop", "pause" and "resume" jobs. These jobs will specify the action in the action attribute and show "manual" as the actionType attribute.

Configuration

The scheduler recognizes the following configuration tags.

ElementTypeDefaultDescription
enabledbooleantrueWhether the scheduler is enabled. If false, then no jobs will be submitted for any configured schedule.
schedules  One or more schedules on which jobs will be fired. Also see the section on schedules stored in a database below.
schedules/schedule  A schedule on which jobs will be fired.
schedules/schedule/@nameString The (optional) name for the schedule.
schedules/schedule/@enabledbooleantrueWhether this specific schedule is enabled. If false, then no jobs will be submitted for this schedule.
schedules/schedule/@singletonbooleantrueSpecifies that this schedule may only fire one job at a time. If true and the scheduled time is reached again, then a new job will only be published if the previous job has completed.
schedules/schedule/cronStringMandatory for this scheduleSpecifies the schedule in cron style (see above for the format). This must be specified for any schedule configured here.
schedules/schedule/jobString Specifies the job data that will be published when the scheduled time is reached. The data can be specified in either XML or JSON style (indicated by the type attribute – see below). The data will have the scheduler information added as attributes to the root node. If not specified, an empty document will be published.
NOTE: this configuration item is a String and XML/JSON text should be surrounded with a <[CDATA[]]>.
schedules/schedule/job/@typeStringxmlSpecifies style of the data in the <job> tag. Can be either xml or json.
schedules/schedule/eventStringMandatory for this scheduleSpecifies the event to publish the job to. Must match one of the events configured in the branch handler <branches> configuration.
quartzN/A Container for the properties to be passed to the Quartz Scheduler.
quartz/propertyString The value of the property to be passed to the Quartz Scheduler.
quartz/property/@nameString The name of the property to be passed to the Quartz Scheduler.


The scheduler can read its schedules from a database. To configure this, the following configuration can be used:

ElementTypeDescription
rdb/@componentStringIf schedules should be loaded from a database, this attribute holds the path to the Aspire database connection pool component (aspire-rdb).
rdb/sql/schedulesStringIf schedules should be loaded from a database, this element holds the SQL that will be used to extract the schedules from the database configured via the schedules/@rdb attribute. See below for the columns that should be returned.
rdb/sql/jobRunningCheckStringIf schedules taken from the RDB are singletons, this SQL will be run when the schedule fires to check whether a job is still running. If not specified, no check on the database will be performed, but the existing check making sure that the number of outstanding jobs is 0 may still prevent the job from firing. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobStartedStringThis SQL is run when a job is started. Typically it is used to allow singleton control via an external database. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobStoppedStringThis SQL is run when a stop job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobPausedStringThis SQL is run when a pause job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobResumedStringThis SQL is run when a resume job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobFinishedStringThis SQL is run when a job finishes successfully. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobFailedStringThis SQL is run when a job finishes with an error. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. However, if the job failed, the external process may not have marked the job as complete, meaning singleton jobs would be blocked. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/crawlIdStringThe SQL used to determine the crawl id. If this SQL exists, it is run whenever a job is published and the result is added to the job in the crawlId attribute of the document. The first column of the first row of the result set is used as the crawl ID.
rdb/autoReloadScheduleslongTime in milliseconds between automatic reloads of the schedules from the RDB. If missing or 0, automatic reloads will be disabled.

Database Schedule Selection SQL

The SQL should return the mandatory columns and may return the optional columns from the following:

ColumnDescription
nameThe schedule name
enabledTrue if the schedule is enabled (defaults to true).
singletonTrue if this schedule is a singleton (defaults to true).
cronThe cron schedule (mandatory).
jobTypeThe type of data given in the jobData column (defaults to XML).
jobDataThe data to be sent in the job when the scheduled time is reached. This may be given in XML or JSON

format as specified by the jobType column and should be given as a string.

eventThe event to publish the job on (mandatory).
sourceIdThe external ID (of the source) to be added to the job (if available).

The format of the columns follows the formats given in the Basic Configuration section above. Column names can be enforced by use of the SQL “AS” keyword.

Database Job Control SQL

SQL contained in the jobRunningCheck, jobStarted, jobFinished and jobFailed may contain variables for substitution. Variables are surrounded with { } (see Simple Templates for more details). The following variables my be specified:

VariableAvailableDescription
scheduleralwaysThe component name of the scheduler.
scheduleIdalwaysThe ID of the schedule that fired this job.
sourceNamealwaysThe name of the source that fired this job.
sourceIdalwaysThe source ID of the source that fired this job if available (from the sourceId column of the schedule SQL).
jobNumberjobStarted, jobStopped, jobPaused, jobResumed,jobFinished, jobFailedThe unique number allocated to this job from the scheduler.
jobIdjobStarted, jobStopped, jobPaused, jobResumed,, jobFinished, jobFailedThe job ID associated to the Job object published for this schedule.
jobSuccessjobFinished, jobFailedtrue if the job listener received a JobComplete event (i.e. the job completed the pipeline without failure), false otherwise.
jobResultjobFinished, jobFailedXML representation of the result from the JobEvent.

Branch Configuration

The Aspire Scheduler publishes jobs using the branch manager. Thus it requires the standard Branch Handler configuration detailed below:

ElementTypeDescription
branches/branch/@eventStringThe event to configure. At the very least, you should include the onPublish event.
branches/branch/@pipelineManagerStringThe URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelineStringThe name of the pipeline to publish to.
branches/branch/@stageStringThe name of the stage to publish to.

Example Configuration

   <component name="myScheduler" subType="default" factoryName="aspire-scheduler">
     <schedules>
       <schedule name="myFirstSchedule" enabled="false">
         <cron>1/10 * * * * ?</cron>
         <event>onPublish</event>
         <job>
           <![CDATA[
           <doc>
             <fetchUrl>support.searchtechnologies.com</fetchUrl>
           </doc>
           ]]>
         </job>
       </schedule>
       <schedule enabled="false">
         <cron>2/10 * * * * ?</cron>
         <event>onPublish2</event>
       </schedule>
       <schedule enabled="false">
         <cron>3/10 * * * * ?</cron>
         <event>onPublish3</event>
         <job type="json">
           <![CDATA[
           {
             "doc" : {
               "fetchUrl" : "www.searchtechnologies.com"
             }
           }
           ]]>
         </job>
       </schedule>
       <schedule enabled="false">
         <cron>4/10 * * * * ?</cron>
         <event>onPublish4</event>
         <job type="json">
           <![CDATA[
           {
             "doc" : {
               "fetchUrl" : "repositories.searchtechnologies.com"
             }
           }
           ]]>
         </job>
       </schedule>
     </schedules>
     <branches>
       <branch event="onPublish" pipelineManager="PipelineManager" />
       <branch event="onPublish2" pipelineManager="PipelineManager" pipeline="myPipeline" />
       <branch event="onPublish3" pipelineManager="PipelineManager" pipeline="myPipeline" stage="myStage" />
       <branch event="onPublish4" pipelineManager="PipelineManager-not-exist" />
     </branches>
   </component>

Servlet Commands

The following servlet commands are available via the scheduler (via http://server:port/scheduler?cmd=XXXX&param=value):

CommandDescriptionParameters
addAdds a schedule to the schedulerevent: the event the schedule should publish to

cron: the cron schedule
name: the name for the schedule (optional)
enabled: true if the schedule is enabled (optional - defaults to true)
singleton: true if only one job should run at a time (optional - defaults to false)
job: the data to be sent when the schedule fires (optional)
jobType: the format of the job parameter - xml/json (optional - defaults to xml)

deleteDeletes a schedule from the schedulerextId: the external ID of the schedule to be deleted (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be deleted (optional, but this or extId must be specified)

disableDisables the scheduler, or a schedule if specifiedextId: the external ID of the schedule to be disabled (optional)

schedId: the ID of the schedule to be disabled (optional)
If no schedule is specified, the scheduler will be disabled

enableEnables the scheduler, or a schedule if specifiedextId: the external ID of the schedule to be enabled (optional)

schedId: the ID of the schedule to be enabled (optional)
If no schedule is specified, the scheduler will be enabled

reloadReloads all the schedules from the database.None
startSends a 'start' job for the given scheduleextId: the source (external) ID of the schedule to be started (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be started (optional, but this or extId must be specified)
properties: string containing properties to be sent in the actionProperties attribute of the job (see below)

stopSends a 'stop' job for the given scheduleextId: the source (external) ID of the schedule to be stopped (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be stopped (optional, but this or extId must be specified)

pauseSends a 'pause' job for the given scheduleextId: the source (external) ID of the schedule to be paused (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be paused (optional, but this or extId must be specified)

resumeSends a 'resume' job for the given scheduleextId: the source (external) ID of the schedule to be resumed (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be resumed (optional, but this or extId must be specified)

Services Interface

Other components will be able to access the scheduler via a number of methods. These are made available via two interfaces – one to handle the schedules and one to handle the scheduler.

The component exposes the following interface to handle jobs:

AspireSchedule.java

The component will expose the following interface to handle the scheduler:

AspireScheduler.java

 

  • No labels