...
Each of these would return (on average) one tenth of the data
Incremental crawling invloves a s stage process. First a "pre-update" stage occurs. This happens as soon as the incremental crawl stages and is executed once (for the crawl). This phase runs sql against the database and can be used to (say) mark rows that should be processed in this crawl. Next the main update stage occurs. This selects data from the main data table(s) in a similar way to the full crawl, but should also select information about the update, incuding its type (insert/update/delete) and an identifier for the update (typically its sequence number). Again, this main stage occurs once per crawl. Finally a "post process" stage occurs. This is different in that it is executed for each row processed. It's purpose (typically) is to mark the updates as processed, so they are not considered again next time.
In order to handle updates, you should configure a queue table in your database. The actual name of the table can be anything, as long as it is referenced consistently in the SQL statements used in the job or configuration file. When a row changes in the table, you should use a configured database trigger to fire and insert a row into this queue table. The queue table must have (at least) four columns which contain values for sequence, id, action and status. Again, these columns can be named anything you want; the first three are set using job or configuration parameters, and the status is used only in the SQL statements below.
Your pre-update SQL can then update the status (from W to I) for those updates you wish to process on this pass. This could be all the updates that are not complete, or a subset (the first n). Typically you should process the updates in the order they are received - i.e. increasing sequence.
Your data extraction SQL should then join this update table with the main data table, selecting the columns you wish to index (those with a status of I) and the id, sequence and action at a minimum, ordering the rows by increasing sequence number. The column names returned for id, sequence and action must match the column names specified in the job parameters, configuration parameters or defaults - you may need to use the SQL AS construct.
Document will then be formed for each row returned and submitted to the Aspire workflow. The connector will ensure that jobs created for a particular document id are finished before another for the same id is started, but jobs for differing ids could be submitted out of sequence order.
When a document is completed, the "post-update" SQL from the configuration will be executed. This SQL statement will undergo substitution, allowing values from the processes rows to be userd. The possible substitutions are shown below:
Placeholder | Description |
---|---|
{documentId} | The document id from the column identified by the id column. |
{action} | The action from the column identified by the action column. |
{sequenceId} | The sequence from the column identified by the sequence column. |
{failed} | true if the submitted job failed, false otherwise. |
The "post-update" SQL should update the status in the queue table (from I to C) to indicate the update has been processed.
If you wish to keep the number of rows in the queue table to a minimum, you may use the "post-update" SQL to delete the row. You may then not need to use the "pre-update" SQL. In this case, this could be omitted from the configuration.
Consider the following configuration
Pre incremental sql
Code Block | ||||
---|---|---|---|---|
| ||||
UPDATE
updates_table
SET
status='I'
WHERE
seq_id IN
(
SELECT
seq_id
FROM
updates_table
WHERE
status IS null OR
(
status!='I' AND
status != 'C'
)
ORDER BY
seq_id ASC
FETCH FIRST 4 ROWS ONLY
) |
When run, the sql selects from the updates table, the 4 rows with the lowest sequence number, whose status is not I (in process) or C complete and sets the row status to I (in process).
Incremental sql
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT
updates_table.seq_id AS seq_id,
updates_table.id AS doc_id,
updates_table.action AS update_action,
main_data.col1 AS cola,
main_data.col2 AS colb,
main_data.col3 AS coly,
main_data.col4 AS colz
FROM
main_data
RIGHT OUTER JOIN
updates_table
ON
main_data.id = updates_table.id
WHERE
updates_table.status = 'I'
ORDER BY
updates_table.seq_id ASC |
When run, data is selected from the main table, where the a corresponding row in the updates table is marked as I (in progress). The join between the tables is on "id".
Post incremental sql
Code Block | ||||
---|---|---|---|---|
| ||||
UPDATE
updates_table
SET
status = 'C'
WHERE
seq_id = {sequenceId} |
This sql is run when i document is successfully processed by the connector. It uses the sequence id from the document and updates the status of the row with that sequence number so that it is not processed again. The sequence id in the document was taken from the column in the update sql identified in the "Sequence column" configuration, using the {sequenceId} variable
Bounded incremental crawls