The Aspider Web Crawler Database Server Connector can be configured using the Aspire Admin UI. It requires the following entities to be created:

Credential
Connection
Connector
Seed

Easy Heading Free

navigationTitle	On this Page
wrapNavigationText	true
navigationExpandOption	expand-all-by-default

Create Credential

On the Aspire Admin UI go to the credentials page pageImage Added
All existing credentials will be listed. Click on the new button buttonImage Added
Enter the new credential description.
Select Database Server from the Type list.
General: In Credential type, select between Basic Authentication and Kerberos Authentication.
1. Basic authentication: Fill the gaps with your user and password.
2. Kerberos authentication: Fill username, Keytab File(path to the Keytab file), External Jars Path(the path of the folder where other needed files are located), Hadoop Resource Files(path to Hadoop files).

Create Connection

On the Aspire Admin UI go to the connections page pageImage Added
All existing connections will be listed. Click on the new button buttonImage Added
Enter the new connection description.
Select Database Server from the Type list.
General:
1. Server URL: The connection stream of the server.
2. JDBC driver: The path to the driver (.jar).
3. Specify JDBC Driver Class: Select this option in case the driver name is non-standard.
4. Specify Classpath: Select this option in case additional drivers need to be uploaded, and add the path of the folder which contains the additional drivers.
Scan Options:
1. Stop scan on error: The scan will be stopped as soon as an error occurs during scanning.
2. Prefetch size: The number of items to be loaded in memory at a time.
3. Index DBs and tables: To index the metadata from the databases and tables. Choose between:
  1. Extract table row count: to include the number of rows of the table.
  2. Add tables schema: to include the tables structure.
  3. Use query for table metadata extraction: A specific query to extract additional data. ex.
    Code Block
    select data from admin_table where table_id={{table}} and database_id={{database}}
  4. Add resultSet to table job: adds the content of the tables to the job.
4. Enable row extraction: index all the rows in a table.
  1. Limited extracted rows: Specify the number of rows to be extracted. Limit(Number of rows to be extracted). Perform Sampling (randomize the extracted elements).

Info
Please keep in mind that you necessarily need to select Index DBs and tables or Enable row extraction for the connector to work properly. Please remember that both options are exclusive, so you will not be able to use the two of them at the same time.

Create Seed

On the Aspire Admin UI, go to the seeds page pageImage Added
All existing seed will be listed. Click on the new button buttonImage Added
Enter the new seed description.
Select Database Server from the Type list.

Scope: Exclude/include File: can be used to filter the crawled databases and tables. Add the path to the .json file where the excluded/included items are placed.

database: List of databases to exclude/include from the crawl.
1. name: name or pattern of the database to exclude/include.
2. pattern: specifies if the name parameter is a regex pattern or not.
table: List of tables to exclude/include from the crawl.
1. name: name or pattern of the table to exclude/include.
2. database: optional - name of the database that contains the table to exclude/include. If specified, only the tables contained by the database will be filtered
3. pattern: specifies if the name parameter is a regex pattern or not.
schema: List of schemas to exclude/include from the crawl.
1. name: name or pattern of the schema to exclude/include.
2. database: optional - name of the database that contains the schema to exclude/include. If specified, only the schemas contained by the database will be filtered
3. pattern: specifies if the name parameter is a regex pattern or not.

Code Block

{
    "database" : [
        {
            "name" : "dbtest1",
            "pattern" : false
        },{
            "name" : ".*2",
            "pattern" : true
        }
    ],
    "table" : [
        {
            "name" : "test1",
            "database" : "dbtest1",
            "pattern" : false
        }, {
            "name" : ".*3",
            "pattern" : true
        }, {
            "name" : ".*4",
            "database" : "dbtest2",
            "pattern" : true
        }
    ],
    "schema" : [
        {
            "name" : "*schema_1*",
            "database" : "dbtest_1",
            "pattern" : false
        }
    ]
}

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Create Credential

Create Connection

Create Connector Instance

Create Seed

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Create Credential

Create Connection

Create Connector Instance

Create Seed