Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
type | Yes | - | No | The value must be "aspider". | "aspider" |
description | Yes | - | No | Name of the credential object. | "AspiderCredential" |
properties | Yes | - | No | Configuration object | |
useSelenium | No | false | No | Flag to let aspider know if it has to set up Selenium | true / false |
webDriverImplementation | Yes | - | No | Browser used by selenium. Possible values:
Note: Only used if useSelenium is set to true. |
|
webDriverPath | Yes | - | No | Path to the selenium web driver executable. Note: The driver must have execution permission. Note: Only used if useSelenium is set to true. | "lib\\chromedriver.exe" |
headlessMode | Yes | - | No | Flag to start the browser on headless mode (no GUI). Note: Only used if useSelenium is set to true. | true / false |
authMech | No | [] | Yes | Array containing the authentication mechanisms | [] |
host | No | "" | No | Hostname where the authentication mechanism should be used. Note: If empty, the authentication mechanism will be used against any host. | "example.com" |
port | Yes | -1 | No | Port where the authentication mechanism should be used. Note: -1 means the URL can have any port. | 8000 |
scheme | Yes | - | No | Scheme to use during the authentication. Possible values:
|
|
user | Yes | - | No | Name of the account to authenticate with. | "Administrator" |
password | Yes | - | No | Password to authenticate with. See Encryption API for more information. | "123456abC" |
domain | No | "" | No | Domain of the account used to authenticate with. | "EXAMPLE" |
realm | No | "" | No | Realm of the account to authenticate with. | "my-realm" |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
adfs | Yes | false | No | Flag to indicate if ADFS should be used, only required when scheme is "NTLM". | true / false |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
useDefaultKrb5 | No | true | No | Flag to indicate if Aspider should use the system settings for Kerberos. | true / false |
kdc | Yes | - | No | Hostname of the key distribution center to get the Kerberos tickets. | "kdc.example.com" |
verbose | Yes | false | No | Flag to indicate if the entire negotiation process should be logged. | true / false |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
loginUrl | Yes | - | No | URL of the login page | "https://example.com/login" |
formPath | Yes | - | No | CSS Selector for getting the login form. | "#content > form" |
userField | Yes | - | No | Id of the username field | "txtUser" |
passwordField | Yes | - | No | Id of the password field | "txtPass" |
adfs | No | false | No | Flag to enable the ADFS flow of requests during authentication. | true / false |
saml | No | false | No | Flag to enable the SAML flow of requests during authentication. | true / false |
retries | Yes | - | No | Number of retries to do if the authentication fails. | 5 |
customField | No | [] | Yes | Array of other fields in the form | [] |
name | Yes | - | No | Name of the field in the form | "myField" |
value | Yes | - | No | Value of the field in the form | "myValue" |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
loginUrl | Yes | - | No | URL of the login page | "https://example.com/login" |
user | Yes | - | No | username to authenticate with | "user" |
password | Yes | - | No | password | "123456abC" |
loginScript | No | - | No | Script with the instructions to log in. Avalable variables:
| |
sessionScript | No | - | No | Script with the instructions to validate the session, must return true if the session is valid. Avalable variables see above. |
Code Block | ||||
---|---|---|---|---|
| ||||
{ "type": "Aspider", "description": "AspiderCredential", "properties": { "useSelenium": true, "webDriverImplementation": "CHROME", "webDriverPath": "/dev/chromedriver.exe", "headlessMode": false, "authMech": [ { "host": "chessbase.com", "port": -1, "scheme": "Basic", "user": "userP", "password": "passwd", "domain": "", "realm": "" } ] } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | Id of the credential to update. | "2f287669-d163-4e35-ad17-6bbfe9df3778" |
description | Yes | - | No | Name of the credential object. | "Aspider Credential" |
properties | Yes | - | No | Configuration object | |
(see create credential) |
Example
Code Block | ||||
---|---|---|---|---|
| ||||
{ "id": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "description": "AspiderCredential", "properties": { "useSelenium": true, "webDriverImplementation": "CHROME", "webDriverPath": "/dev/chromedriver.exe", "headlessMode": false, "authMech": [ { "host": "chessbase.com", "port": -1, "scheme": "Basic", "user": "userP", "password": "passwd", "domain": "", "realm": "" } ] } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
type | Yes | - | No | The value must be "aspider". | "aspider" |
description | Yes | - | No | Name of the connection object. | "My Aspider connection" |
credential | Yes | - | No | Id of the credential assigned to this object. | "2a5ca234-e328-4d40-bb2a-2df3e550b065" |
throttlePolicy | No | - | No | Id of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The ids of the routing policies that this connection will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
properties | Yes | - | No | Configuration object | |
Scope | |||||
crawlScope | yes | HOST | No | Determines the scope of the crawl. The values are HOST, EVERYTHING, CUSTOM. Everything will allow all URLs to be crawled. Host only will allow only URLs like 'abc.domain.com' when your seeds contain 'abc.domain.com'. Custom scope allows you to add one or more regular expression patterns to match the host name. These scopes can be extended by the Include patterns. | "HOST" |
scopePattern | yes for crawl scope "CUSTOM" | - | No | Custom scope patterns. Any URL matching the following patterns will be included as part of the scope. Pattern evaluated against the document URL. | ".*\\.example.com" |
userAgent | yes | Aspider - The Aspire Web Crawler | No | User Agent request header to identify the web crawler. | "Aspider - The Aspire Web Crawler" |
obeyRobots | yes | true | No | If checked, the crawler will obey the robots.txt restrictions of each site. | true |
obeyMetaRobots | yes | true | No | If checked, the crawler will obey the HTML robots meta tags. | true |
caseSensitiveUrls | yes | true | No | If unchecked all URLs will be transformed into lower-case before processing. | true |
maxHops | |||||
maxOutLinks | |||||
extractValueProps | |||||
followRedirects | |||||
extractJavaScript | |||||
includes | |||||
excludes | |||||
scanExcludedItems | |||||
Document processing | |||||
cleanupRule | |||||
Crawler | |||||
maxContentSize | |||||
showNon200AsErrors | |||||
stopOnScanError | |||||
logCrawledUrls | |||||
debugContentOutput | |||||
incrementalUrlCleanupRegex | |||||
excludeMultimedia | |||||
Connection | |||||
requestHeader | |||||
trustAllCertificates | |||||
connectionTimeout | |||||
connectionRequestTimeout | |||||
socketTimeout | |||||
useProxy | |||||
Security | |||||
staticAcl |
Code Block | ||||
---|---|---|---|---|
| ||||
{ "type": "Aspider", "description": "Aspider Test Connector", "credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "properties": { "crawlScope": "CUSTOM", "scopePattern": ".*\\.example.com", "userAgent": "Aspider - The Aspire Web Crawler", "obeyRobots": true, "obeyMetaRobots": true, "caseSensitiveUrls": true, "maxHops": 5, "maxOutLinks": 6000, "extractValueProps": true, "followRedirects": true, "extractJavaScript": true, "includes": ".*\\.pdf", "excludes": ".*\\.xml", "scanExcludedItems": true, "absoluteExclude": ".*\\.xml$", "cleanupRule": [ { "urlPattern": ".*\\xml", "contentTypes": "text/html\\.*", "noIndexClassnames": "nofollow", "cleanupPattern": "<!--noindex-->", "cleanupBeforeExtraction": true } ], "maxContentSize": "10mb", "showNon200AsErrors": true, "stopOnScanError": true, "logCrawledUrls": true, "debugContentOutput": true, "incrementalUrlCleanupRegex": "\\?.*", "excludeMultimedia": true, "requestHeader": [ { "header": "customHeader", "value": "val" } ], "trustAllCertificates": false, "connectTimeout": "10s", "connectionRequestTimeout": "10s", "socketTimeout": "10s", "useProxy": true, "proxyHost": "proxy.domain", "proxyPort": 8080, "proxyAuthentication": "NTLM", "proxyDomain": "sss", "proxyUser": "sss", "proxyPassword": "sss", "staticAcl": [ { "name": "acl", "domain": "domain", "entity": "group", "access": "allow" } ] } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | Id of the connection to update | "89d6632a-a296-426c-adb0-d442adcab4b0", |
description | No | - | No | Name of the connection object. | "MyAspiderConnection" |
credential | No | - | No | Id of the credential assigned to this object. | "2a5ca234-e328-4d40-bb2a-2df3e550b065" |
throttlePolicy | No | - | No | Id of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The ids of the routing policies that this connection will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
properties | Yes | - | No | Configuration object |
Code Block | ||||
---|---|---|---|---|
| ||||
{ "id": "89d6632a-a296-426c-adb0-d442adcab4b0", "description": "Aspider Test Connector", "credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "properties": { } } |
For the creation of the Connector object using the Rest API check this page
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
seed | Yes | - | No | <seed description> | |
type | Yes | - | No | The value must be "<connector>". | "<connector>" |
description | Yes | - | No | Name of the seed object. | "My<connector>Seed" |
connector | Yes | - | No | The id of the connector to be used with this seed. The connector type must match the seed type. | "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31" |
connection | Yes | - | No | The id of the connection to be used with this seed. The connection type must match the seed type. | "602d3700-28dd-4a6a-8b51-e4a663fe9ee6" |
workflows | No | [ ] | Yes | The ids of the workflows that will be executed for the documents crawled. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
throttlePolicy | No | - | No | Id of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The ids of the routing policies that this seed will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
tags | No | [ ] | Yes | The tags of the seed. These can be used to filter the seed | ["tag1", "tag2"] |
properties | Yes | - | No | Configuration object |
Code Block | ||||
---|---|---|---|---|
| ||||
{ "type": "Aspider", "seed": "directory", "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31", "description": "Aspider_Test_Seed", "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b", "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"], "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6", "workflows": ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"], "tags": ["tag1", "tag2"], "properties": { } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | Id of the seed to update. | "2f287669-d163-4e35-ad17-6bbfe9df3778" |
seed | No | - | No | <seed description> | |
description | No | - | No | Name of the seed object. | "My<connector>Seed" |
connector | No | - | No | The id of the connector to be used with this seed. The connector type must match the seed type. | "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31" |
connection | No | - | No | The id of the connection to be used with this seed. The connection type must match the seed type. | "602d3700-28dd-4a6a-8b51-e4a663fe9ee6" |
workflows | No | [ ] | Yes | The ids of the workflows that will be executed for the documents crawled. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
workflows.add | No | [ ] | Yes | The ids of the workflows to add. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
workflows.remove | No | [ ] | Yes | The ids of the workflows to remove. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
throttlePolicy | No | - | No | Id of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The ids of the routing policies that this seed will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
routingPolicies.add | No | [ ] | Yes | The ids of the routingPolicies to add. | ["b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
routingPolicies.remove | No | [ ] | Yes | The ids of the routingPolicies to remove. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7"] |
tags | No | [ ] | Yes | The tags of the seed. These can be used to filter the seed | ["tag1", "tag3"] |
tags.add | No | [ ] | Yes | The tags to add | ["tag4"] |
tags.remove | No | [ ] | Yes | The tags to remove | ["tag2"] |
properties | Yes | - | No | Configuration object |
Code Block | ||||
---|---|---|---|---|
| ||||
{ "id": "2f287669-d163-4e35-ad17-6bbfe9df3778", "seed": "<seed example>", "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31", "description": "Aspider_Test_Seed", "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b", "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"], "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6", "workflows": ["b255e950-1dac-46dc-8f86-1238b2fbdf27", "f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"], "tags": ["tag", "tag2"], "properties": { } } |