The RDB via Table connector crawls connector via Snapshots will crawl content from any relational database that can be accessed using JDBC. The connector
extractswill extract data based on SQL statements and
submitssubmit this data
intoin to Aspire for processing
. The connector is different from many other connectors as it directly extracts the data
. This means that there is, so typically there's not a fetch data phase. However, if your database includes references to external data (say URLs to web sites or paths of external files), then a fetch
stagestage may be invoked.
Easy Heading Free | ||||
---|---|---|---|---|
|
The RDB connector via Table connector features Snapshots features include the following:
The content retrieved by the connector is entirely defined entirely using using SQL statements, so you can select all or subsets of columns from one or more tables. Initially, the data is inserted into in to Aspire using the returned column names, but this may be changed by further Aspire processing.
The connector can operate in two modes: full and incremental.
Important: The data submitted to Aspire by this connector is dependent entirely on the SQL that's configured. Therefore, it is quite possible to submit all of the data in an incremental crawl, or only some of the data in a full crawl.
In full mode, the connector executes a single SQL database statement and submits each row returned for processing in Aspire.
In incremental mode, there are three stages of processing: preprocessing, crawling, and post-processing.
(Optional) This stage runs a SQL statement against the database that can be used to mark rows to crawl (i.e., they have changed since the previous run).
This stage (similar to full mode) executes a single SQL database statement and submits each row returned for processing in Aspire. Typically, the result set is a subset of the full data that may be filtered using information updated in the (optional) pre-processing stage.
This mode uses SQL taken from the seed configuration (<discoverySQL>, <extractSQL>) and execute them against the database configured. Each resulting row is formed into a result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.
Column names from the extractSQL query are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.
Any change detected in the query set in discoverySQL field will be compare with the snapshot file and report the change if required.
This mode uses SQL taken from the seed configuration (<fullSQL> or configuration) and execute them against the database configured. Each resulting row is formed into a result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.
Column names from SQL queries are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.
Any change detected in the query set in fullSQL field will be compare with the snapshot file and report the change if required(Optional) Each row of data submitted to Aspire can execute a SQL statement to update its status in the database. This may be to reset a flag set in the (optional) pre-processing stage, thereby denoting the item as processed. Different SQL can be executed for rows that were successfully processed versus ones that were not.
The RDB via Table Snapshots connects to databases via JDBC, so you'll need the appropriate JDBC client (driver) JAR file for the database you want to connect to. These are available for most (if not all) major database vendors, and your first port of call for the driver should be the vendor's website.
A prerequisite for crawling any RDBMS is to have an RDBMS account. The recommended name for this account is "aspire_crawl_account" or something similar. The username and password for this account will be required below.
The "aspire_crawl_account" will need to have sufficient access rights to read all of the documents in the RDBMS that you wish to crawl.
To set the rights for your "aspire_crawl_account", do the following:
You will need this login information later in these procedures, when entering properties for your RDB Connector via Table.
No special requirements here
Name | Supported |
---|---|
Content Crawling | yes |
Identity Crawling | no |
Snapshot-based Incremental s | no |
Non-snapshot-based Incremental s | yes |
Document Hierarchy | no |
Click here to find out various crawling options
The RDB via Tables connector is able to crawl the following objects:
Name | Type | Relevant Metadata | Content Fetch & Extraction | Description |
---|---|---|---|---|
database row | table fields | NA | Fields requested by SQL |
No limitations defined