Page History

...

Each of these would return (on average) one tenth of the data

Incremental feed

Incremental crawling invloves a s stage process. First a "pre-update" stage occurs. This happens as soon as the incremental crawl stages and is executed once (for the crawl). This phase runs sql against the database and can be used to (say) mark rows that should be processed in this crawl. Next the main update stage occurs. This selects data from the main data table(s) in a similar way to the full crawl, but should also select information about the update, incuding its type (insert/update/delete) and an identifier for the update (typically its sequence number). Again, this main stage occurs once per crawl. Finally a "post process" stage occurs. This is different in that it is executed for each row processed. It's purpose (typically) is to mark the updates as processed, so they are not considered again next time.

In order to handle updates, you should configure a queue table in your database. The actual name of the table can be anything, as long as it is referenced consistently in the SQL statements used in the job or configuration file. When a row changes in the table, you should use a configured database trigger to fire and insert a row into this queue table. The queue table must have (at least) four columns which contain values for sequence, id, action and status. Again, these columns can be named anything you want; the first three are set using job or configuration parameters, and the status is used only in the SQL statements below.

sequence is a unique incrementing ID for this change.
id indicates the ID (primary key) of the row in the main data table that changed.
action is a single character indicating the type of change (I for insert, U for update, and D for delete, or N for no change)
status is a flag indicating the current status of the update (suggested values are W for waiting to be processed, I for in progress and C for completed.

Your pre-update SQL can then update the status (from W to I) for those updates you wish to process on this pass. This could be all the updates that are not complete, or a subset (the first n). Typically you should process the updates in the order they are received - i.e. increasing sequence.

Your data extraction SQL should then join this update table with the main data table, selecting the columns you wish to index (those with a status of I) and the id, sequence and action at a minimum, ordering the rows by increasing sequence number. The column names returned for id, sequence and action must match the column names specified in the job parameters, configuration parameters or defaults - you may need to use the SQL AS construct.

Document will then be formed for each row returned and submitted to the Aspire workflow. The connector will ensure that jobs created for a particular document id are finished before another for the same id is started, but jobs for differing ids could be submitted out of sequence order.

Post Update SQL

When a document is completed, the "post-update" SQL from the configuration will be executed. This SQL statement will undergo substitution, allowing values from the processes rows to be userd. The possible substitutions are shown below:

Placeholder	Description
{documentId}	The document id from the column identified by the id column.
{action}	The action from the column identified by the action column.
{sequenceId}	The sequence from the column identified by the sequence column.
{failed}	true if the submitted job failed, false otherwise.

The "post-update" SQL should update the status in the queue table (from I to C) to indicate the update has been processed.

Alternative Process

If you wish to keep the number of rows in the queue table to a minimum, you may use the "post-update" SQL to delete the row. You may then not need to use the "pre-update" SQL. In this case, this could be omitted from the configuration.

Example

Consider the following configuration

Pre incremental sql

Code Block

language	sql
theme	Midnight

UPDATE 
  updates_table 
SET 
  status='I' 
WHERE 
  seq_id IN 
  ( 
    SELECT 
      seq_id 
    FROM 
      updates_table 
    WHERE 
      status IS null OR 
      ( 
        status!='I' AND 
        status != 'C' 
      ) 
    ORDER BY 
      seq_id ASC 
    FETCH FIRST 4 ROWS ONLY 
  )

When run, the sql selects from the updates table, the 4 rows with the lowest sequence number, whose status is not I (in process) or C complete and sets the row status to I (in process).

Incremental sql

Code Block

language	sql
theme	Midnight

SELECT 
  updates_table.seq_id AS seq_id, 
  updates_table.id     AS doc_id, 
  updates_table.action AS update_action, 
  main_data.col1       AS cola, 
  main_data.col2       AS colb, 
  main_data.col3       AS coly, 
  main_data.col4       AS colz 
FROM 
  main_data 
RIGHT OUTER JOIN 
  updates_table 
ON 
  main_data.id = updates_table.id 
WHERE 
  updates_table.status = 'I' 
ORDER BY 
  updates_table.seq_id ASC

When run, data is selected from the main table, where the a corresponding row in the updates table is marked as I (in progress). The join between the tables is on "id".

Post incremental sql

Code Block

language	sql
theme	Midnight

UPDATE 
  updates_table
SET
  status = 'C'
WHERE
  seq_id = {sequenceId}

This sql is run when i document is successfully processed by the connector. It uses the sequence id from the document and updates the status of the row with that sequence number so that it is not processed again. The sequence id in the document was taken from the column in the update sql identified in the "Sequence column" configuration, using the {sequenceId} variable

Bounded incremental crawls

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Incremental feed

Post Update SQL

Alternative Process

Example