Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Full crawls

Full crawls use the SQL from the Full crawl sql configuration and execute this against the database configured. Each selected row is formed into a document using the column names as document elements, and this document is sent to the Aspire workflow as an add. The document is creating with an id whose value is taken from the column identified in the Id column configuration. All columns selected will appear in the document sent to Aspire, although their position in the document will depend if they are recognised as standard Aspire Fields

Example SQL for a full crawls

Any valid sql that performs a select will allow you to perform a full crawl.

Simple

Code Block
languagesql
themeMidnight
select * from main_table

Complex

Code Block
languagesql
themeMidnight
select mt.id as ID, mt.first_name, mt.last_name, mt.first_name + " " + mt.last_name AS full_name, , mt.col3, jt.col1, jt.col2, jt.col3 from main_table mt, joined_table jt where mt.id = jt.id 

Using slices

To increase performance of full crawls, you can opt to "slice" data. When selected the full crawl sql will be modified with a where clause based on the chosen number of slices and the row id. This means that rather than trying to select all rows from the data base in a single select, the connector will perform multiple smaller selects in parallel. 

Example

If you had full crawl sql of

...

Each of these would return (on average) one tenth of the data

Incremental feed

Incremental crawling invloves a s stage process. First a "pre-update" stage occurs. This happens as soon as the incremental crawl stages and is executed once (for the crawl). This phase runs sql against the database and can be used to (say) mark rows that should be processed in this crawl. Next the main update stage occurs. This selects data from the main data table(s) in a similar way to the full crawl, but should also select information about the update, incuding its type (insert/update/delete) and an identifier for the update (typically its sequence number). Again, this main stage occurs once per crawl. Finally a "post process" stage occurs. This is different in that it is executed for each row processed. It's purpose (typically) is to mark the updates as processed, so they are not considered again next time.

...

Document will then be formed for each row returned and submitted to the Aspire workflow. The connector will ensure that jobs created for a particular document id are finished before another for the same id is started, but jobs for differing ids could be submitted out of sequence order.

Post Update SQL

When a document is completed, the "post-update" SQL from the configuration will be executed. This SQL statement will undergo substitution, allowing values from the processes rows to be userd. The possible substitutions are shown below:

...

The "post-update" SQL should update the status in the queue table (from I to C) to indicate the update has been processed.

Alternative Process

If you wish to keep the number of rows in the queue table to a minimum, you may use the "post-update" SQL to delete the row. You may then not need to use the "pre-update" SQL. In this case, this could be omitted from the configuration.

Example

Consider the following configuration

...

If you choose SQL, you will enter SQL that is then run at the beginning of a crawl to return the new upper bound. The upper bound will by taken from the first column of the first row returned

Time bound example

To select all rows in a table where an update has been made since the last crawl, checking the modified_date column (containing timestamps as a long value), your incremental sql cound look like this

Code Block
languagesql
themeMidnight
select * from main_table where modified_date > {lowerBound} and modified_date <= {upperBound}

SQL bound example

To select all rows in a table where an update has been made since the last crawl, checking a transaction id in the trans_id column, you could configure the Upper bound sql as

...