Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


The RDB connector via Snapshots will crawl content from any relational database that can be accessed using JDBC. The connector will extract data based on SQL statements and submit this data in to Aspire for processing.  The connector is different from many other connectors as it directly extracts the data, so typically there's not a fetch data phase. However, if your database includes references to external data (say URLs to web sites or paths of external files), then a fetch stage may be invoked.

Easy Heading Free
navigationTitleOn this Page
navigationExpandOptionexpand-all-by-default

Introduction

RDB connector via Snapshots features include the following:

  • Connects to database server using JDBC drivers (these must be downloaded separately)
  • Performs full crawling
  • Performs incremental crawling, so that only new/updated documents are indexed, using snapshot files
  • Fetches data from the database using SQL statements
  • Is search engine independent

  • Runs from any machine with access to the given database

    The Selenium connector will crawl content from websites using an internet browser to retrieve the pages.

    Environment and Access Requirements

    Repository Support

    JDBC Drivers

    The RDB via Snapshots connects to databases via JDBC, so you'll need the appropriate JDBC client (driver) JAR file for the database you want to connect to. These are available for most (if not all) major database vendors, and your first port of call for the driver should be the vendor's website.

    Account Privileges

    A prerequisite for crawling any RDBMS is to have an RDBMS account. The recommended name for this account is "aspire_crawl_account" or something similar. The username and password for this account will be required below.

    The "aspire_crawl_account" will need to have sufficient access rights to read all of the documents in the RDBMS that you wish to crawl.

    To set the rights for your "aspire_crawl_account", do the following:

    1. Log into the RDBMS as an Administrator.
    2. Make the role of the "aspire_crawl_account" either administrator or superuser (so that it has access to all RDBMS content).

    You will need this login information later in these procedures, when entering properties for your RDB Connector via Table.

    Environment Requirements


    Web driver

    The Aspire Selenium connector requires the latest instance of the web browser to be used, and its respective web driver. The web driver only supports a range of browser versions, if the browser is outside that range the connector will throw an exception while trying to start using the browser.

    Before installing the Selenium connector, make sure that:

    • A supported web browser is installed on all the Aspire nodes.
    • A web driver that supports the web browser in its current version.

    Environment Requirements

    The Selenium connector run on either Windows or Linux. The web drivers include a version appropriate for each operative systemNo special requirements here

    Framework and Connector Features


    Framework Features

    NameSupported
    Content CrawlingYes
    Identity CrawlingNo
    Snapshot-based Incremental sYes
    Non-snapshot-based IncrementalsNo
    Document HierarchyNoYes

    Connector Features

    Retrieve Data per Batch

    This mode uses SQL taken from the seed configuration (<discoverySQL><extractSQL>) and execute them against the database configured. Each resulting row is formed into a result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.

    Column names from the extractSQL query are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.

    Any change detected in the query set in discoverySQL field will be compare with the snapshot file and report the change if required.

    Retrieve Everything

    This mode uses SQL taken from the seed configuration (<fullSQL> or configuration) and execute them against the database configured. Each resulting row is formed into a  result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.

    Column names from SQL queries are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.

    Any change detected in the query set in fullSQL field will be compare with the snapshot file and report the change if required

    Some features of the Selenium connector include:

    • Use a real browser to retrieve the pages.
    • Avoid compatibility issues with web frameworks such as Angular, React, Node, among others.

    Content Crawled


    The content retrieved by the connector is entirely defined using SQL statements, so you can select all or subsets of columns from one or more tables. Initially, the data is inserted in to Aspire using the returned column names, but this may be changed by further Aspire processing.

    The RDB via Snapshots connector is able to crawl the following objects:

    Selenium connector retrieves several types of documents, such as: 

    • Web Pages.
    • Sitemaps.
    • Binary documents (PDF, word, images).
    &
    NameTypeNameType Relevant MetadataContent Fetch and ExtractionDescriptiondatabase rowtable fieldsNAFields requested by SQL
    Web PagedocumentHTML Meta tags, HTTP headersYesPages discovered on the target website

    Limitations


    Due to Selenium's own limitations, the connector doesn't support:

    • Basic authentication
    • NTLM authentication
    • Custom HTTP headers.

    Due to API limitations, Selenium connector is only compatible with browsers that have a Web Driver implementation, for example:

    • Google Chrome
    • Mozilla Firefox

    Other features are also dependent on browser support, such as Headless Mode.

    Limitations

    No limitations defined