Frequently Asked Questions (FAQ)

How can I perform incremental crawling

There are a number of ways in which you can perform incremental crawling, but here's one method.

You should configure an updates queue table in your database. The actual name of the table can be anything, as long as it is referenced consistently in the SQL statements used in the job or configuration file.

Then code a trigger that fires when an update is made to the table(s) that contains your data. This trigger should, when fired, insert a row into this queue table. The queue table must have (at least) four columns which contain values for sequence, id, action and status. Again, these columns can be named anything you want; the first three are set using job or configuration parameters, and the status is used only in the following SQL statements. The columns are described below:

  • sequence is a unique incrementing ID for this change.
  • id indicates the ID (primary key) of the row in the main data table that changed.
  • action is a single character indicating the type of change (I for insert, U for update, and D for delete, or N for no change).
  • status is a flag indicating the current status of the update (suggested values are W for waiting to be processed, I for in progress and C for completed).

When an incremental crawl is run, the pre-processing SQL is run first. You can use this to update the status (from W to I) for those updates you wish to process on this pass. This could be all the updates that are not complete, or a subset (the first n). Typically you should process the updates in the order they are received, i.e., increasing sequence.

So after the pre-processing, your queue table will contain a number of rows marked as in progress and containing the id of the row in the data table.

The data extraction stage is the run. Your data extraction SQL should then join this queue table with the main data table, selecting the columns you wish to index (those with a status of I) and the id, sequence and action at a minimum, ordering the rows by increasing sequence number. The column names returned for id, sequence and action must match the column names specified in the job parameters, configuration parameters or defaults - you may need to use the SQL AS construct.

Each row returned by the SQL will be submitted to Aspire with an appropriate action. You can use Aspire to process these actions differently.

Once a row is submitted, it enters the post-processing stage. Here SQL is executed for each row that was submitted to Aspire. Different SQL can be executed for rows that have been successfully processed versus rows that failed processing. The "post-update" SQL should update the status in the queue table (from I to C) to indicate the update has been processed.

Alternative Process

If you wish to keep the number of rows in the queue table to a minimum, you may use the "post-update" SQL to delete the row. You may then not need to use the "pre-update" SQL. In this case, this could be omitted from the configuration.

Common Errors

Connection errors

If you can't get a connection to the database, you'll see an exception similar to this:

Couldn't get an RDB connection

com.searchtechnologies.aspire.services.AspireException: Unable to open connection to jdbc:postgresql://localhost/RDBMS_Test_DB (aspire_user) (component='/My_RDB_Source/RDBMSRDB', componentFactory='aspire-multiple-rdb')
	at com.searchtechnologies.aspire.components.rdb.MultiRDBConnectionPool$ConnectionPool.createConnection(MultiRDBConnectionPool.java:535)
	at com.searchtechnologies.aspire.components.rdb.MultiRDBConnectionPool$ConnectionPool.getConnection(MultiRDBConnectionPool.java:477)
	at com.searchtechnologies.aspire.components.rdb.MultiRDBConnectionPool$ConnectionPool.access$000(MultiRDBConnectionPool.java:387)
	at com.searchtechnologies.aspire.components.rdb.MultiRDBConnectionPool.getConnection(MultiRDBConnectionPool.java:131)
	at com.searchtechnologies.aspire.components.rdbconnector.RDBScanner.doCrawl(RDBScanner.java:420)
	at com.searchtechnologies.aspire.scanner.AbstractPushScanner.performScan(AbstractPushScanner.java:30)
	at com.searchtechnologies.aspire.scanner.AbstractScanner.scanProcess(AbstractScanner.java:701)
	at com.searchtechnologies.aspire.scanner.AbstractScanner.process(AbstractScanner.java:323)
	at com.searchtechnologies.aspire.application.JobHandlerImpl.runNested(JobHandlerImpl.java:158)
	at com.searchtechnologies.aspire.application.JobHandlerImpl.run(JobHandlerImpl.java:80)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:207)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:136)
	at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:29)
	at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:21)
	at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:31)
	at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
	at org.postgresql.Driver.makeConnection(Driver.java:410)
	at org.postgresql.Driver.connect(Driver.java:280)
	at com.searchtechnologies.aspire.components.rdb.MultiRDBConnectionPool$ConnectionPool.createConnection(MultiRDBConnectionPool.java:525)
	... 12 more
Caused by: java.net.ConnectException: Connection refused: connect
	at java.net.DualStackPlainSocketImpl.connect0(Native Method)
	at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
	at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.net.PlainSocketImpl.connect(Unknown Source)
	at java.net.SocksSocketImpl.connect(Unknown Source)
	at java.net.Socket.connect(Unknown Source)
	at java.net.Socket.connect(Unknown Source)
	at org.postgresql.core.PGStream.<init>(PGStream.java:60)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:101)
	... 21 more


SQL Errors

Errors in the SQL configured for the job will produce exceptions such as:

2014-02-10T19:54:07Z ERROR: Error executing select sql for full crawl (SELECT "id", "name", "company", "email",  birthday FROM Test_table WHERE "id" <= 1000 AND  MOD(id,10) = 9 )
org.postgresql.util.PSQLException: ERROR: relation "test_table" does not exist
  Position: 57
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2161)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1890)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:560)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:302)
	at com.searchtechnologies.aspire.components.rdbconnector.RDBScanner.executeSelectSQL(RDBScanner.java:57)
	at com.searchtechnologies.aspire.components.rdbconnector.RDBScanner.doCrawl(RDBScanner.java:466)
	at com.searchtechnologies.aspire.scanner.AbstractPushScanner.performScan(AbstractPushScanner.java:30)
	at com.searchtechnologies.aspire.scanner.AbstractScanner.scanProcess(AbstractScanner.java:701)
	at com.searchtechnologies.aspire.scanner.AbstractScanner.process(AbstractScanner.java:323)
	at com.searchtechnologies.aspire.application.JobHandlerImpl.runNested(JobHandlerImpl.java:158)
	at com.searchtechnologies.aspire.application.JobHandlerImpl.run(JobHandlerImpl.java:80)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
  • No labels