Create Credential

Field	Required	Default	Multiple	Notes	Example
type	Yes	-	No	The value must be "selenium".	"selenium"
description	Yes	-	No	Name of the credential object.	"My Selenium Credential"
properties	Yes	-	No	Configuration object
authenticationHandler	Yes	-	Yes	Authentication handlers	[]
host	No	-	No	Hostname of the urls that will apply this handler, if no hostname is set, it will be used for all.	"domain.com"
port	No	-1	No	Port of the url, if set to -1, any port will be accepted	8080
loginUrl	Yes	-	No	Url to the login page	"http://yoursite/login.php"
user	Yes	-	No	User name	"admin"
password	Yes	-	No	User password	"password"
authenticationType	Yes	-	No	Authentication implementation: The crawler will use this configuration to log in the page (SIMPLE/SCRIPT)	"SIMPLE"
				Simple Authentication
userSelectorType	Yes	Id	No	Username field selector type (Class, Css, Id, Name, XPath)	"Id"
userSelector	yes	-	No	Username field: Field on the login form where the username should be set	"txtUsername"
passwordSelectorType	Yes	Id	No	Password field selector type (Class, Css, Id, Name, XPath)	"Id"
passwordSelector	yes	-	No	Password field: Field on the login form where the password should be set	"txtPassword"

customField	no	-	Yes	Custom field	[]
selectorType	yes	Id	No	Type of selector used to locate the field within the page. (Class, Css, Id, Name, XPath)	"Id"
selector	yes	-	no	Value of the selector used to locate the field within the page.	"myField"
fieldType	no	Text	no	Type of field to locate within the page. (Button, Checkbox, Select, RadioButton, Text)	"Text"
fieldValue	yes	-	no	Value of the field	"myFieldValue"

submitSelectorType	yes	Id	No	Type of selector used to locate the field within the page. (Class, Css, Id, Name, XPath)	"Id"
submitSelector	yes	-	-	Value of the selector used to locate the field within the page.	"btnSubmit"
				Scripted authentication
authenticationScript	yes	-	No	Groovy Code with instructions to fill the login form (see code block below)

Authentication Script

// Write the instructions to fill the login form
//
n// Available variables:
//
// - seedId, String. Id of the current seed.
// - logger, ALogger. Logger instance.
// - driver, WebDriver. Selenium WebDriver instance for controlling a browser.
// - loginUrl, String. Login URL.
// - username, String. Username
// - password, String. Password (decrypted). IMPORTANT: DO NOT LOG THIS VALUE

Field	Required	Default	Multiple	Notes	Example
verificationType	Yes	-	no	Verification style (SIMPLE, SCRIPT)	"SIMPLE"
				Simple verification
verificationField	Yes	-	yes	Fields to verify	[]
fieldSelectorType	yes	Id	no	Type of selector used to locate the field within the page. (Class, Css, Id, Name, XPath)	"Id"
fieldSelector	yes	-	no	Value of the selector used to locate the field within the page.	"myField"
				Scripted verification
verificationScript	yes	-	no	Groovy Code with instructions to validate whether or not the login was successful or not. Must return a boolean (see code block below)

Verification Script

// Check if the session is still valid.
// Must return a boolean.
// If the session is invalid, a login shall be attempted
// 
// Available variables:
//
// - seedId, String. Id of the current seed.
// - logger, ALogger. Logger instance.
// - driver, WebDriver. Selenium WebDriver instance for controlling a browser.
// - loginUrl, String. Login URL.

return false;

Example

POST aspire/_api/credentials

{
    "type": "selenium",
    "description": "Selenium Desc",             
     "properties": {
       "authenticationHandler": [
          {
            "host": "domain.com",
            "port": 80,
            "loginUrl": "http://yoursite/login.php",
            "user": "user",
            "password": "password",
            "authenticationType": "SIMPLE",
            "userSelectorType": "Id",
            "userSelector": "txtUsername",
            "passwordSelectorType": "Id",
            "passwordSelector": "txtPassword",
            "customField": [],
            "submitSelectorType": "Id",
            "submitSelector": "btnSubmit",
            "verificationType": "SIMPLE",
            "verificationField": [
              {
                "fieldSelectorType": "Id",
                "fieldSelector": "myField"
               }
             ]
           }
         ]
    }
}

Update Credential

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	Id of the credential to update.	"2f287669-d163-4e35-ad17-6bbfe9df3778"
description	Yes	-	No	Name of the credential object.	"Selenium Credential"
properties	Yes	-	No	Configuration object
(see create credential)

Example

PUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065

{
    "id": "2a5ca234-e328-4d40-bb2a-2df3e550b065",
    "description": "SeleniumCredential",
      "properties": {
       "authenticationHandler": [
          {
            "host": "domain.com",
            "port": 80,
            "loginUrl": "http://yoursite/login.php",
            "user": "user",
            "password": "password",
            "authenticationType": "SIMPLE",
            "userSelectorType": "Id",
            "userSelector": "txtUsername",
            "passwordSelectorType": "Id",
            "passwordSelector": "txtPassword",
            "customField": [],
            "submitSelectorType": "Id",
            "submitSelector": "btnSubmit",
            "verificationType": "SIMPLE",
            "verificationField": [
              {
                "fieldSelectorType": "Id",
                "fieldSelector": "myField"
               }
             ]
           }
         ]
    } 
 }

Create Connection

Field	Required	Default	Multiple	Notes	Example
type	Yes	-	No	The value must be "selenium".	"selenium"
description	Yes	-	No	Name of the connection object.	"My Selenium Connection"
throttlePolicy	No	-	No	Id of the throttle policy that applies to this connection object.	"6b235b333a1b"
routingPolicies	No	[ ]	Yes	The ids of the routing policies that this connection will use.	["17f75ce7d0c7", "d42780003b36"]
credential	Yes	-	No	Id of the credential	"6b235b333a1b"
properties	Yes	-	No	Configuration object
				Web driver
webDriverImplementation	Yes	CHROME	No	Web driver implementation you want to use, this is related to the browser that will be controlled by Selenium. (CHROME, FIREFOX)	"CHROME"
webDriverPath	Yes	-	No	Set the path to the web driver. Once a driver implementation is selected, a path to the executable driver must be provided.	"/driver/chromedriver.exe"
headlessMode	No	true	No	Set the headless mode if required, if the connector is running on headless mode UI window will not be displayed for the browser.	true
				Scope
crawlScope	yes	HOST	No	Determines the scope of the crawl. The values are HOST, EVERYTHING, CUSTOM. Everything will allow all URLs to be crawled. Host only will allow only URLs like 'abc.domain.com' when your seeds contain 'abc.domain.com'. Custom scope allows you to add one or more regular expression patterns to match the host name. These scopes can be extended by the Include patterns.	"HOST"
scopePattern	yes for crawl scope "CUSTOM"	-	yes	Custom scope patterns. Any URL matching the following patterns will be included as part of the scope. Pattern evaluated against the document URL.	[".\\.example.com",".\\.another.com"]
obeyRobots	no	true	No	If checked, the crawler will obey the robots.txt restrictions of each site.	true
caseSensitiveUrls	no	true	No	If unchecked all URLs will be transformed into lower-case before processing.	true
maxHops	yes	5	No	Crawl Depth. How many hops from the seed is the crawler allowed to follow links.	5
includes	no	-	Yes	The document will be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.	".*\\.pdf"
excludes	no	-	Yes	The document will not be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.	".*\\.xml"
				Document processing
cleanupRule	no	-	Yes (see fields below)	Content cleanup rules. Specific behavior will apply to the URLs that match the following patterns.
pattern	yes	-	No	The URL will be matched against this pattern to check if it should be cleansed.	".*\.xml$"
contentType	yes	-	No	Regular expression evaluated against the document mime type to check if the document should be cleansed.	"text/html\\.*"
noIndexClass	no	-	No	Comma separated list of CSS classes that will be removed from the page content.	"noindex, nofollow"
cleanupPattern	no	-	No	Regular expression to remove matching text from the page the content.	"<!-- noindex -->.*<!-- /noindex -->"

pageSettings	no	-	yes (see fields below)	Page settings. The crawler will apply the following behavior to the URLs that match the patterns.
urlPattern	yes	-	no	The URL will be matched against this pattern to check if it should be cleansed.	".*\\.xml$"
cooldown	yes	1s	no	Time to wait the page to finish loading before further processing.	"1s"
useLinkExtractionScript	no	false	no	Override link extraction logic. Check this field to override the way the crawl extracts the links from a page.	no
linkExtractionScript	no	-	no	Link extraction script. Script with the instructions to extract the links of a page. (see the code block below)

/* Add the discovered URLs to the variable 'discoveredUrls'
 *
 * Avalable variables:
 *  - seedId: String, Id of the current seed
 *  - logger: ALogger, logger implementation
 *  - discoveredUrls: List<String>, List with all the urls discovered in the current page
 */
import com.accenture.aspire.framework.utilities.StringUtilities;
import org.openqa.selenium.WebDriverException;

def pageLinks = [];

try {
    pageLinks = driver.findElements(By.tagName("a"));
}
catch (WebDriverException wde) {
    /* Do Nothing */
}

try {
    pageLinks = pageLinks + driver.findElements(By.xpath("//*[@src]"));
}
catch (WebDriverException wde) {
    /* Do Nothing */
}

pageLinks.each { url ->
    String link = url.getAttribute("href");
    
    if (StringUtilities.isEmpty(link))
        link = url.getAttribute("src");
    
    discoveredUrls.add(link);
}

				Crawler
maxContentSize	yes	10mb	No	Max content size for a page. Maximum content size allowed to be fetched.	15mb
showNon200AsErrors	no	true	No	Show 400s and 500s status codes as errors. Uncheck if you want to only mark those URLs as "excluded" instead of "errored".	true
stopOnScanError	no	true	No	Stop on scan error. If unchecked, scan errors will stop the crawl from continuing	true
logCrawledUrls	no	true	No	Log crawled URLs. If checked a log with all the crawled urls will be created.	true
debugContentOutput	no	false	No	Write contents to file. If checked, the crawler will write every page in the local file system. The folder to where the files will be created is \"data/CONTENT-SOURCE-NAME/output\".	true
incrementalUrlCleanupRegex	no	-	No	Url cleanup for incremental. Regex for cleaning up the url in case of dynamicaly generated parameters. This is to prevent the incremental crawls to consider the urls as different documents when the only difference are dynamic parameters. For example http://myhost/my-page.html?mydinamic=123456 gets transformed to http://myhost/my-page.html for incremental purposes, but the original url is still going to be used for fetching.	"\\?.*"
excludeMultimedia	no	true	No	Reject Images / Videos / Javascript / CSS. If checked js, css, swf, gif, png, jpg, jpeg, bmp, mp3, mp4, avi, mpg and mpeg files will be excluded from the crawl.	false
				Network
connectionTimeout	yes	10s	No	Timeout used when a connection is established.	20s
connectionRequestTimeout	yes	10s	No	Timeout used when requesting a connection from the connection manager.	15s
socketTimeout	yes	10s	no	Timeout used for waiting for data. (Maximum period inactivity between two consecutive data packets.)	10s
useProxy	no (see fields below)	false	no	Use proxy. If checked, the crawler will connect through a proxy.	yes
proxyHost	yes	-	no	Proxy hostname	"your-proxy.domain.com"
proxyPort	yes	8080	no	Proxy port	8080
proxyAuthentication	no (if yes see fields below)	false	no	Use proxy	true
proxyUser	yes	-	no	Proxy username	"user"
proxyPassword	yes	-	no	Proxy password	"password"
				Security
staticAcl	no	-	yes (see fields below)	Static ACLs. These ACLs will be added to all of the documents.
name	yes	-	no	name of the ACL	"john.doe"
domain	no	-	no	Domain to which ACL belongs to	"domain"
entity	no	user	no	Whether or not this ACL is for a group or a user. (user/group)	"group"
access	no	allow	no	Whether or not this ACL will have access to crawled files (allow/deny)	"deny"

Example

POST aspire/_api/connections

{
   "type": selenium,
   "description": "Selenium",
   "properties": {
     "webDriverImplementation": "CHROME",
     "webDriverPath": "/tmp/ach1/driver/chromedriver.exe",
     "headlessMode": true,
     "crawlScope": "CUSTOM",
     "scopePattern": [
       ".*\\.example.com",
       ".*\\.another.com"
     ],
     "obeyRobots": true,
     "caseSensitiveUrls": true,
     "maxHops": 5,
     "includes": ".*\\.pdf",
     "excludes": ".*\\.xml",
     "pageSettings": [
       {
         "urlPattern": ".*\\.xml$",
         "cooldown": "1s",
         "useLinkExtractionScript": true,
         "linkExtractionScript": "/* Add the discovered URLs to the variable 'discoveredUrls'\r\n *\r\n * Avalable variables:\r\n *  - seedId: String, Id of the current seed\r\n *  - logger: ALogger, logger implementation\r\n *  - discoveredUrls: List<String>, List with all the urls discovered in the current page\r\n */\r\nimport com.accenture.aspire.framework.utilities.StringUtilities;\r\nimport org.openqa.selenium.WebDriverException;\r\n\r\ndef pageLinks = [];\r\n\r\ntry {\r\n    pageLinks = driver.findElements(By.tagName(\"a\"));\r\n}\r\ncatch (WebDriverException wde) {\r\n    /* Do Nothing */\r\n}\r\n\r\ntry {\r\n    pageLinks = pageLinks + driver.findElements(By.xpath(\"//*[@src]\"));\r\n}\r\ncatch (WebDriverException wde) {\r\n    /* Do Nothing */\r\n}\r\n\r\npageLinks.each { url ->\r\n    String link = url.getAttribute(\"href\");\r\n    \r\n    if (StringUtilities.isEmpty(link))\r\n        link = url.getAttribute(\"src\");\r\n    \r\n    discoveredUrls.add(link);\r\n}"
         },
       {
        "urlPattern": ".*tmp[^/]$",
        "cooldown": "1s",
        "useLinkExtractionScript": false
        }
      ],
      "cleanupRule": [
        {
          "pattern": ".*\\.xml$",
          "contentType": "text/html\\.*",
          "noIndexClass": "",
          "cleanupPattern": ""
         },
         {
           "pattern": ".*tmp[^/]$",
           "contentType": "text/html\\.*",
           "noIndexClass": "",
           "cleanupPattern": ""
          }
        ],
        "maxContentSize": "10mb",
        "showNon200AsErrors": false,
        "stopOnScanError": true,
        "logCrawledUrls": true,
        "debugContentOutput": false,
        "incrementalUrlCleanupRegex": "",
        "excludeMultimedia": true,
        "connectTimeout": "10s",
        "connectionRequestTimeout": "10s",
        "socketTimeout": "10s",
        "useProxy": true,
        "proxyHost": "your-proxy.domain.com",
        "proxyPort": 8080,
        "proxyAuthentication": true,
        "proxyUser": "admin",
        "proxyPassword": "passord",
        "staticAcl": [
          {
           "name": "john.doe",
           "domain": "",
           "entity": "user",
           "access": "allow"
          },
          {
            "name": "jana.doe",
            "domain": "",
            "entity": "user",
            "access": "allow"
          }
        ]
     }
}

Update Connection

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	Id of the connection to update	"d442adcab4b0",
description	No	-	No	Name of the connection object.	"My Selenium Connection"
throttlePolicy	No	-	No	Id of the throttle policy that applies to this connection object.	"b3a9-6b235b333a1b"
routingPolicies	No	[ ]	Yes	The ids of the routing policies that this connection will use.	["17f75ce7d0c7", "d42780003b36"]
credential	No	-	No	Id of the credential	"6b235b333a1b"
properties	No	-	No	Configuration object
(see create connection)

Example

PUT aspire/_api/connections/89d6632a-a296-426c-adb0-d442adcab4b0

{
   "id": "89d6632a-a296-426c-adb0-d442adcab4b0",
   "description": "Selenium",
    "properties": {
     "webDriverImplementation": "CHROME",
     "webDriverPath": "/tmp/ach1/driver/chromedriver.exe",
     "headlessMode": true,
     "crawlScope": "CUSTOM",
     "scopePattern": [
       ".*\\.example.com",
       ".*\\.another.com"
     ],
     "obeyRobots": true,
     "caseSensitiveUrls": true,
     "maxHops": 5,
     "includes": ".*\\.pdf",
     "excludes": ".*\\.xml",
     "pageSettings": [
       {
         "urlPattern": ".*\\.xml$",
         "cooldown": "1s",
         "useLinkExtractionScript": true,
         "linkExtractionScript": "/* Add the discovered URLs to the variable 'discoveredUrls'\r\n *\r\n * Avalable variables:\r\n *  - seedId: String, Id of the current seed\r\n *  - logger: ALogger, logger implementation\r\n *  - discoveredUrls: List<String>, List with all the urls discovered in the current page\r\n */\r\nimport com.accenture.aspire.framework.utilities.StringUtilities;\r\nimport org.openqa.selenium.WebDriverException;\r\n\r\ndef pageLinks = [];\r\n\r\ntry {\r\n    pageLinks = driver.findElements(By.tagName(\"a\"));\r\n}\r\ncatch (WebDriverException wde) {\r\n    /* Do Nothing */\r\n}\r\n\r\ntry {\r\n    pageLinks = pageLinks + driver.findElements(By.xpath(\"//*[@src]\"));\r\n}\r\ncatch (WebDriverException wde) {\r\n    /* Do Nothing */\r\n}\r\n\r\npageLinks.each { url ->\r\n    String link = url.getAttribute(\"href\");\r\n    \r\n    if (StringUtilities.isEmpty(link))\r\n        link = url.getAttribute(\"src\");\r\n    \r\n    discoveredUrls.add(link);\r\n}"
         },
       {
        "urlPattern": ".*tmp[^/]$",
        "cooldown": "1s",
        "useLinkExtractionScript": false
        }
      ],
      "cleanupRule": [
        {
          "pattern": ".*\\.xml$",
          "contentType": "text/html\\.*",
          "noIndexClass": "",
          "cleanupPattern": ""
         },
         {
           "pattern": ".*tmp[^/]$",
           "contentType": "text/html\\.*",
           "noIndexClass": "",
           "cleanupPattern": ""
          }
        ],
        "maxContentSize": "10mb",
        "showNon200AsErrors": false,
        "stopOnScanError": true,
        "logCrawledUrls": true,
        "debugContentOutput": false,
        "incrementalUrlCleanupRegex": "",
        "excludeMultimedia": true,
        "connectTimeout": "10s",
        "connectionRequestTimeout": "10s",
        "socketTimeout": "10s",
        "useProxy": true,
        "proxyHost": "your-proxy.domain.com",
        "proxyPort": 8080,
        "proxyAuthentication": true,
        "proxyUser": "admin",
        "proxyPassword": "passord",
        "staticAcl": [
          {
           "name": "john.doe",
           "domain": "",
           "entity": "user",
           "access": "allow"
          },
          {
            "name": "jana.doe",
            "domain": "",
            "entity": "user",
            "access": "allow"
          }
        ]
     } }

Create Connector

For the creation of the Connector object using the Rest API check this page

Update Connector

For the update of the Connector object using the Rest API check this page

Create Seed

Field	Required	Default	Multiple	Notes	Example
seed	Yes	-	No	The name of the database. It will replace the marker {DATABASE} used in the field jdbcUrl of connection object	"test_db"
type	Yes	-	No	The value must be "rdb-snapshot".	"rdb-snapshot"
description	Yes	-	No	Name of the seed object.	"My RDB Seed"
connector	Yes	-	No	The id of the connector to be used with this seed. The connector type must match the seed type.	"e3ca414b0d31"
connection	Yes	-	No	The id of the connection to be used with this seed. The connection type must match the seed type.	"e4a663fe9ee6"
workflows	No	[ ]	Yes	The ids of the workflows that will be executed for the documents crawled.	["5696c3f0bda4"]
throttlePolicy	No	-	No	Id of the throttle policy that applies to this seed object.	"6b235b333a1b"
routingPolicies	No	[ ]	Yes	The ids of the routing policies that this seed will use.	["17f75ce7d0c7", "d42780003b36"]
tags	No	[ ]	Yes	The tags of the seed. These can be used to filter the seed	["tag1", "tag2"]
properties	Yes	-	No	Configuration object
fullSQL	Yes (this or discoverySQL + extractionSQL)	-	No	The "SELECT" query to be run to retrieve all documents. This query is used for full or incremental scans. The "WHERE" clause can be used to specify any required condition for crawling the desired documents. Any change to any column selected in this SQL will cause the document to be re-indexed. For example "SELECT idCol, col1, col2, col3 FROM data_table" When slicing is enabled, add a "WHERE" clause containing "{SLICES}". For example "SELECT idCol, col1, col2, col3 FROM data_table WHERE {SLICES}" .	"SELECT * FROM table"
discoverySQL	Yes (this or fullSQL)	-	No	The "SELECT" query to run for discovering documents. This query is used for full or incremental scans. A "WHERE" clause can be used to specify any required condition for crawling the desired documents. A change to any column selected in this SQL will cause the document to be re-indexed. For example: "SELECT idCol, lastModifiedDate FROM data_table". When slicing is enabled, add a "WHERE" clause containing "{SLICES}". For example: "SELECT idCol, col1 FROM data_table WHERE {SLICES}	"SELECT id, lastModified FROM table"
extractionSQL	Yes (this or fullSQL)	-	No	"SELECT" query for extracting all data for each document found in the Discovery SQL. At the least, you MUST include a "WHERE" clause containing the expression "idColumnName IN {IDS}", where idColumnName corresponds to a unique key field name. {IDS} is replaced automatically by the connector with the corresponding unique key values. For example: "SELECT col1, col2, col3 FROM data_table WHERE idCol in {IDS}" You must not include the {SLICES} condition here.	"SELECT * FROM table WHERE id IN {IDS}"
idColumn	Yes	-	No	The column name that holds the unique key. The default name of the column which holds the value to use as the document id. This column must be present in both discoverySQL and extractionSQL. SQL aliases are NOT supported.	"id"
stringIdColumn	No	false	No	Check if the unique key is a string value	true
quoteId	No	doNotQuote	No	Quote id column - use if you have a name clashing with RDBMS keywords. You can use one of the values: doNotQuote, `, "	doNotQuote
				ACL
aclColumn	Yes (aclColumn or aclSQL)	-	No	The column name that holds the ACLs. Each ACL must be separated by semi-colons and must follow this format: my-domain\userOrGroup@NT	"acl"
aclSQL	Yes (aclColumn or aclSQL)	-	No	The query to use for extracting and building ACLs. This query depends of the Database engine, so the syntax could vary. For example on Oracle: SELECT 'my-domain\\' \|\| user \|\| '@NT;' FROM myTable	"SELECT * FROM table_acl"

Example

POST aspire/_api/seeds

{
  "seed":"test_db",
  "type":"rdb-snapshot",
  "description" : "RDB_TEST",
  "properties" : {
      "idColumn" : "film_id",
      "stringIdColumn" : false,
      "aclSQL" : null,
      "aclColumn" : "acl",
      "quoteId" : "doNotQuote",
      "discoverySQL" : "SELECT film_id, title FROM film",
      "extractionSQL" : "SELECT * FROM film WHERE film_id IN {IDS}",
      "fullSQL" : null"
  }
}

Update Seed

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	Id of the seed to update	"2f287669-d163-4e35-ad17-6bbfe9df3778"
(see the "Create seed" for other fields)

Example

PUT aspire/_api/seeds/2f287669-d163-4e35-ad17-6bbfe9df3778

{
  "id": "2f287669-d163-4e35-ad17-6bbfe9df3778",
  "seed":"test_db",
  "description" : "RDB_TEST",
  "properties" : {
      "idColumn" : "film_id",
      "stringIdColumn" : false,
      "aclSQL" : null,
      "aclColumn" : "acl",
      "quoteId" : "doNotQuote",
      "discoverySQL" : "SELECT film_id, title FROM film",
      "extractionSQL" : "SELECT * FROM film WHERE film_id IN {IDS}",
      "fullSQL" : null"
  }
}

Page tree

Create Credential

Example

Update Credential

Example

Create Connection

Example

Update Connection

Example

Create Connector

Update Connector

Create Seed

Example

Update Seed

Example

Contact Us: [email protected]

Page tree

Rest API - Selenium Crawler Configuration

Create Credential

Example

Update Credential

Example

Create Connection

Example

Update Connection

Example

Create Connector

Update Connector

Create Seed

Example

Update Seed

Example

Contact Us: [email protected]