Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Aspider Web Crawler Database Server Connector can be configured using the Aspire Admin UI. It requires the following entities to be created:

  • Credential
  • Connection
  • Connector
  • Seed

Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Credential 


  1. On the Aspire Admin UI go to the credentials page pageImage Added
  2. All existing credentials will be listed. Click on the new button buttonImage Added
  3. Enter the new credential description.
  4. Select Database Server from the Type list.
  5. General: In Credential type, select between Basic Authentication and Kerberos Authentication.
    1. Basic authentication: Fill the gaps with your user and password.
    2. Kerberos authentication: Fill username, Keytab File(path to the Keytab file), External Jars Path(the path of the folder where other needed files are located), Hadoop Resource Files(path to Hadoop files). 


Create Connection 


  1. On the Aspire Admin UI go to the connections page pageImage Added
  2. All existing connections will be listed. Click on the new button buttonImage Added
  3. Enter the new connection description. 
  4. Select Database Server from the Type list.
  5. General:
    1. Server URL: The connection stream of the server.
    2. JDBC driver: The path to the driver (.jar).
    3. Specify JDBC Driver Class: Select this option in case the driver name is non-standard. 
    4. Specify Classpath: Select this option in case additional drivers need to be uploaded, and add the path of the folder which contains the additional drivers.
  6. Scan Options:
    1. Stop scan on error: The scan will be stopped as soon as an error occurs during scanning.
    2. Prefetch size: The number of items to be loaded in memory at a time.
    3. Index DBs and tables: To index the metadata from the databases and tables. Choose between:
      1. Extract table row count: to include the number of rows of the table.
      2.  Add tables schema: to include the tables structure.
      3. Use query for table metadata extraction: A specific query to extract additional data. ex.

        Code Block
        select data from admin_table where table_id={{table}} and database_id={{database}}
      4. Add resultSet to table job: adds the content of the tables to the job.
    4. Enable row extraction: index all the rows in a table.
      1. Limited extracted rows: Specify the number of rows to be extracted. Limit(Number of rows to be extracted). Perform Sampling (randomize the extracted elements).
Info

Please keep in mind that you necessarily need to select  Index DBs and tables or Enable row extraction for the connector to work properly. Please remember that both options are exclusive, so you will not be able to use the two of them at the same time.


Create Connector Instance


For the creation of the Connector object using the Admin UI, check this page.


Create Seed 


  1. On the Aspire Admin UI, go to the seeds page pageImage Added
  2. All existing seed will be listed. Click on the new button buttonImage Added
  3. Enter the new seed description.
  4. Select Database Server from the Type list.
  5. Scope: Exclude/include File: can be used to filter the crawled databases and tables. Add the path to the .json file where the excluded/included items are placed.

    1. database: List of databases to exclude/include from the crawl.
      1. name: name or pattern of the database to exclude/include.
      2. pattern: specifies if the name parameter is a regex pattern or not.
    2. table: List of tables to exclude/include from the crawl.
      1. name: name or pattern of the table to exclude/include.
      2. database: optional - name of the database that contains the table to exclude/include. If specified, only the tables contained by the database will be filtered
      3. pattern: specifies if the name parameter is a regex pattern or not.
    3. schema:  List of schemas to exclude/include from the crawl.
      1. name: name or pattern of the schema to exclude/include.
      2. database: optional - name of the database that contains the schema to exclude/include. If specified, only the schemas contained by the database will be filtered
      3. pattern: specifies if the name parameter is a regex pattern or not.
    Code Block
    {
        "database" : [
            {
                "name" : "dbtest1",
                "pattern" : false
            },{
                "name" : ".*2",
                "pattern" : true
            }
        ],
        "table" : [
            {
                "name" : "test1",
                "database" : "dbtest1",
                "pattern" : false
            }, {
                "name" : ".*3",
                "pattern" : true
            }, {
                "name" : ".*4",
                "database" : "dbtest2",
                "pattern" : true
            }
        ],
        "schema" : [
            {
                "name" : "*schema_1*",
                "database" : "dbtest_1",
                "pattern" : false
            }
        ]
    }