You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

The Aspider Web Crawler can be configured using the Aspire Admin UI. It requires the following entities to be created:

  • Credential
  • Connection
  • Connector
  • Seed

Create Credential 


  1. On the Aspire Admin UI go to the credentials page 
  2. All existing credentials will be listed. Click on the new button 
  3. Enter the new credential description.
  4. Select Database Server from the Type list.
  5. General: In Credential type select between Basic Authentication and Keberos Authentication.
    1. Basic authentication: Fill the gaps with your user and password.
    2. Kerberos authentication: Fill username, Keytab File(path to the Keytab file), External Jars Path(the path of the folder where other needed files are located), Hadoop Resource Files(path to hadoop files). 


Create Connection 


  1. On the Aspire Admin UI go to the connections page 
  2. All existing connections will be listed. Click on the new button 
  3. Enter the new connection description. 
  4. Select Database Server from the Type list.
  5. General:
    1. Server URL: The connection stream of the server.
    2. JDBC driver: The path to the driver (.jar)
    3. Specify JDBC Driver Class: Select this option in case the driver name is non-standard. 
    4. Specify Classpath: Select this option in case adtional drivers need to be uploaded and add the path of the folder which contains the aditional drivers.
  6. Scan Options:
    1. Stop scan on error: The scan will be stopped as soon as an error accurs during scanning.
    2. Prefetch size: The number of items to be loaded in memory at a time.
    3. Index DBs and tables: To index the metadata from the databases and tables. Choose between:
      1. Extract table row count: to include the number of rows of the table.
      2.  Add tables schema: to include the tables structure.
      3. Use query for table metadata extraction: A specific query to extract aditional data. ex.

        select data from admin_table where table_id={{table}} and database_id={{database}}
      4. Add resultSet to table job: adds the content of the tables to the job.
    4. Enable row extraction: index all the rows in a table.
      1. Limited extracted rows: Specify the number of rows to be extracted. Limit(Number of rows to be extracted). Perform Sampling(randomize the extracted elements)

Please keep in mind that you necessarily need to select  Index DBs and tables or Enable row extraction for the connector to work properly. Please remember that both options are exclusive, so you will not be able to use the two of them at the same time.


Create Connector Instance


For the creation of the Connector object using the Admin UI check this page.


Create Seed 


  1. On the Aspire Admin UI go to the seeds page 
  2. All existing seed will be listed. Click on the new button 
  3. Enter the new seed description.
  4. Select Database Server from the Type list.
  5. Scope: Exclude/include File: can be used to filter the crawled databases and tables. Add the path to the .json file where the excluded/included items are placed.

    1. database: List of databases to exclude/include from the crawl.
      1. name: name or pattern of the database to exclude/include.
      2. pattern: specifies if the name parameter is a regex pattern or not.
    2. table: List of tables to exclude/include from the crawl.
      1. name: name or pattern of the table to exclude/include.
      2. database: optional - name of the database that contains the table to exclude/include. If specified only the tables contained by the database will be filtered
      3. pattern: specifies if the name parameter is a regex pattern or not.
    3. schema:  List of schemas to exclude/include from the crawl.
      1. name: name or pattern of the schema to exclude/include.
      2. database: optional - name of the database that contains the schema to exclude/include. If specified only the schemas contained by the database will be filtered
      3. pattern: specifies if the name parameter is a regex pattern or not.
    {
        "database" : [
            {
                "name" : "dbtest1",
                "pattern" : false
            },{
                "name" : ".*2",
                "pattern" : true
            }
        ],
        "table" : [
            {
                "name" : "test1",
                "database" : "dbtest1",
                "pattern" : false
            }, {
                "name" : ".*3",
                "pattern" : true
            }, {
                "name" : ".*4",
                "database" : "dbtest2",
                "pattern" : true
            }
        ],
        "schema" : [
            {
                "name" : "*schema_1*",
                "database" : "dbtest_1",
                "pattern" : false
            }
        ]
    }


  • No labels