Full crawls

Full crawls use the SQL from the Full crawl sql configuration and execute this against the database configured. Each selected row is formed into a document using the column names as document elements, and this document is sent to the Aspire workflow as an add. The document is creating with an id whose value is taken from the column identified in the Id column configuration. All columns selected will appear in the document sent to Aspire, although their position in the document will depend if they are recognised as standard Aspire Fields

Example SQL for a full crawls

Any valid sql that performs a select will allow you to perform a full crawl.

Simple

select * from main_table

Complex

select mt.id as ID, mt.first_name, mt.last_name, mt.first_name + " " + mt.last_name AS full_name, , mt.col3, jt.col1, jt.col2, jt.col3 from main_table mt, joined_table jt where mt.id = jt.id 

Using slices

To increase performance of full crawls, you can opt to "slice" data. When selected the full crawl sql will be modified with a where clause based on the chosen number of slices and the row id. This means that rather than trying to select all rows from the data base in a single select, the connector will perform multiple smaller selects in parallel. 

Example

If you had full crawl sql of

select * from main_table

and chose to use 10 slices with an id column of id, you should modify the full crawl sql, adding {slices} to the condition (or adding the condition where there is none):

select * from main_table where {slices}

The sql executed at the server would then be

select * from main_table where id mod 10 = 0
select * from main_table where id mod 10 = 1
select * from main_table where id mod 10 = 2
select * from main_table where id mod 10 = 3
select * from main_table where id mod 10 = 4
select * from main_table where id mod 10 = 5
select * from main_table where id mod 10 = 6
select * from main_table where id mod 10 = 7
select * from main_table where id mod 10 = 8
select * from main_table where id mod 10 = 9

Each of these would return (on average) one tenth of the data

Incremental feed

Incremental crawling invloves a s stage process. First a "pre-update" stage occurs. This happens as soon as the incremental crawl stages and is executed once (for the crawl). This phase runs sql against the database and can be used to (say) mark rows that should be processed in this crawl. Next the main update stage occurs. This selects data from the main data table(s) in a similar way to the full crawl, but should also select information about the update, incuding its type (insert/update/delete) and an identifier for the update (typically its sequence number). Again, this main stage occurs once per crawl. Finally a "post process" stage occurs. This is different in that it is executed for each row processed. It's purpose (typically) is to mark the updates as processed, so they are not considered again next time.

In order to handle updates, you should configure a queue table in your database. The actual name of the table can be anything, as long as it is referenced consistently in the SQL statements used in the job or configuration file. When a row changes in the table, you should use a configured database trigger to fire and insert a row into this queue table. The queue table must have (at least) four columns which contain values for sequence, id, action and status. Again, these columns can be named anything you want; the first three are set using job or configuration parameters, and the status is used only in the SQL statements below.

  • sequence is a unique incrementing ID for this change.
  • id indicates the ID (primary key) of the row in the main data table that changed.
  • action is a single character indicating the type of change (I for insert, U for update, and D for delete, or N for no change)
  • status is a flag indicating the current status of the update (suggested values are W for waiting to be processed, I for in progress and C for completed.

Your pre-update SQL can then update the status (from W to I) for those updates you wish to process on this pass. This could be all the updates that are not complete, or a subset (the first n). Typically you should process the updates in the order they are received - i.e. increasing sequence.

Your data extraction SQL should then join this update table with the main data table, selecting the columns you wish to index (those with a status of I) and the id, sequence and action at a minimum, ordering the rows by increasing sequence number. The column names returned for id, sequence and action must match the column names specified in the job parameters, configuration parameters or defaults - you may need to use the SQL AS construct.

Document will then be formed for each row returned and submitted to the Aspire workflow. The connector will ensure that jobs created for a particular document id are finished before another for the same id is started, but jobs for differing ids could be submitted out of sequence order.

Post Update SQL

When a document is completed, the "post-update" SQL from the configuration will be executed. This SQL statement will undergo substitution, allowing values from the processes rows to be userd. The possible substitutions are shown below:

PlaceholderDescription
{documentId}The document id from the column identified by the id column.
{action}The action from the column identified by the action column.
{sequenceId}The sequence from the column identified by the sequence column.
{failed}true if the submitted job failed, false otherwise.

The "post-update" SQL should update the status in the queue table (from I to C) to indicate the update has been processed.

Alternative Process

If you wish to keep the number of rows in the queue table to a minimum, you may use the "post-update" SQL to delete the row. You may then not need to use the "pre-update" SQL. In this case, this could be omitted from the configuration.

Example

Consider the following configuration

Pre incremental sql

UPDATE 
  updates_table 
SET 
  status='I' 
WHERE 
  seq_id IN 
  ( 
    SELECT 
      seq_id 
    FROM 
      updates_table 
    WHERE 
      status IS null OR 
      ( 
        status!='I' AND 
        status != 'C' 
      ) 
    ORDER BY 
      seq_id ASC 
    FETCH FIRST 4 ROWS ONLY 
  )

When run, the sql selects from the updates table, the 4 rows with the lowest sequence number, whose status is not I (in process) or C complete and sets the row status to I (in process).

Incremental sql

SELECT 
  updates_table.seq_id AS seq_id, 
  updates_table.id     AS doc_id, 
  updates_table.action AS update_action, 
  main_data.col1       AS cola, 
  main_data.col2       AS colb, 
  main_data.col3       AS coly, 
  main_data.col4       AS colz 
FROM 
  main_data 
RIGHT OUTER JOIN 
  updates_table 
ON 
  main_data.id = updates_table.id 
WHERE 
  updates_table.status = 'I' 
ORDER BY 
  updates_table.seq_id ASC 

When run, data is selected from the main table, where the a corresponding row in the updates table is marked as I (in progress). The join between the tables is on "id". 

Post incremental sql

UPDATE 
  updates_table
SET
  status = 'C'
WHERE
  seq_id = {sequenceId}

This sql is run when i document is successfully processed by the connector. It uses the sequence id from the document and updates the status of the row with that sequence number so that it is not processed again. The sequence id in the document was taken from the column in the update sql identified in the "Sequence column" configuration, using the {sequenceId} variable

Bounded incremental crawls

Bounded incremental crawls allow you to select updates based on some limits rather than selecting from an updates table. You can choose to bound incremental crawls using time or some other SQL.

When using bounding, two more variables are available in incremental mode:

PlaceholderDescription
{lowerBound}The lower bound for the crawl
{upperBound}The upper bound for the crawl

When entering SQL you may use the variables {lowerBound} and {upperBound} in a WHERE clause to limit the data collected. The {upperBound} will be calculated at the start of the crawl. The {lowerBound} will be the {upperBound} from the previous crawl. Two types of bounding are available

  • Timestamp
    • returns the bounds as a 'long' value representing the current system time
  • SQL
    • allows you to define SQL to return the new upper bound when the crawl starts

If you choose SQL, you will enter SQL that is then run at the beginning of a crawl to return the new upper bound. The upper bound will by taken from the first column of the first row returned

Time bound example

To select all rows in a table where an update has been made since the last crawl, checking the modified_date column (containing timestamps as a long value), your incremental sql cound look like this

select * from main_table where modified_date > {lowerBound} and modified_date <= {upperBound}

SQL bound example

To select all rows in a table where an update has been made since the last crawl, checking a transaction id in the trans_id column, you could configure the Upper bound sql as

select max(trans_id) from main_table

The incremental sql cound look like this

select * from main_table where trans_id > {lowerBound} and trans_id <= {upperBound}
  • No labels