Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
type | Yes | - | No | The value must be "aspider". | "aspider" |
description | Yes | - | No | Name of the credential object. | "AspiderCredential" |
properties | Yes | - | No | Configuration object | |
useSelenium | No | false | No | Flag to let Aspider know if it has to set up Selenium | true / false |
webDriverImplementation | Yes | - | No | Browser used by selenium. Possible values:
Note: Only used if useSelenium is set to true. |
|
webDriverPath | Yes | - | No | Path to the selenium web driver executable. Note: The driver must have execution permission. Note: Only used if useSelenium is set to true. | "lib\\chromedriver.exe" |
headlessMode | Yes | - | No | Flag to start the browser on headless mode (no GUI). Note: Only used if useSelenium is set to true. | true / false |
authMech | No | [] | Yes | Array containing the authentication mechanisms | [] |
host | No | "" | No | Hostname where the authentication mechanism should be used. Note: If empty, the authentication mechanism will be used against any host. | "example.com" |
port | Yes | -1 | No | Port where the authentication mechanism should be used. Note: -1 means the URL can have any port. | 8000 |
scheme | Yes | - | No | Scheme to use during the authentication. Possible values:
|
|
user | Yes | - | No | Name of the account to authenticate with. | "Administrator" |
password | Yes | - | No | Password to authenticate with. See Encryption API for more information. | "123456abC" |
domain | No | "" | No | Domain of the account used to authenticate with. | "EXAMPLE" |
realm | No | "" | No | Realm of the account to authenticate with. | "my-realm" |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
adfs | Yes | false | No | Flag to indicate if ADFS should be used, only required when scheme is "NTLM". | true / false |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
useDefaultKrb5 | No | true | No | Flag to indicate if Aspider should use the system settings for Kerberos. | true / false |
kdc | Yes | - | No | Hostname of the key distribution center to get the Kerberos tickets. | "kdc.example.com" |
verbose | Yes | false | No | Flag to indicate if the entire negotiation process should be logged. | true / false |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
loginUrl | Yes | - | No | URL of the login page | "https://example.com/login" |
formPath | Yes | - | No | CSS Selector for getting the login form. | "#content > form" |
userField | Yes | - | No | ID of the username field | "txtUser" |
passwordField | Yes | - | No | ID of the password field | "txtPass" |
adfs | No | false | No | Flag to enable the ADFS flow of requests during authentication. | true / false |
saml | No | false | No | Flag to enable the SAML flow of requests during authentication. | true / false |
retries | Yes | - | No | Number of retries to do if the authentication fails. | 5 |
customField | No | [] | Yes | Array of other fields in the form | [] |
name | Yes | - | No | Name of the field in the form | "myField" |
value | Yes | - | No | Value of the field in the form | "myValue" |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
loginUrl | Yes | - | No | URL of the login page | "https://example.com/login" |
user | Yes | - | No | username to authenticate with | "user" |
password | Yes | - | No | password | "123456abC" |
loginScript | No | - | No | Script with the instructions to log in. Available variables:
| |
sessionScript | No | - | No | The script with the instructions to validate the session, must return true if the session is valid. For available variables, see above. |
{ "type": "Aspider", "description": "AspiderCredential", "properties": { "useSelenium": true, "webDriverImplementation": "CHROME", "webDriverPath": "/dev/chromedriver.exe", "headlessMode": false, "authMech": [ { "host": "chessbase.com", "port": -1, "scheme": "Basic", "user": "userP", "password": "passwd", "domain": "", "realm": "" } ] } }
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | Id of the credential to update. | "2f287669-d163-4e35-ad17-6bbfe9df3778" |
description | Yes | - | No | Name of the credential object. | "Aspider Credential" |
properties | Yes | - | No | Configuration object | |
(see create credential) |
{ "id": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "description": "AspiderCredential", "properties": { "useSelenium": true, "webDriverImplementation": "CHROME", "webDriverPath": "/dev/chromedriver.exe", "headlessMode": false, "authMech": [ { "host": "chessbase.com", "port": -1, "scheme": "Basic", "user": "userP", "password": "passwd", "domain": "", "realm": "" } ] } }
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
type | Yes | - | No | The value must be "aspider". | "aspider" |
description | Yes | - | No | Name of the connection object. | "My Aspider connection" |
credential | Yes | - | No | ID of the credential assigned to this object. | "2a5ca234-e328-4d40-bb2a-2df3e550b065" |
throttlePolicy | No | - | No | ID of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The IDs of the routing policies that this connection will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
properties | Yes | - | No | Configuration object | |
Scope | |||||
crawlScope | yes | HOST | No | Determines the scope of the crawl. The values are HOST, EVERYTHING, CUSTOM.
These scopes can be extended by the Include patterns. | "HOST" |
scopePattern | yes for crawl scope "CUSTOM" | - | Yes | For Custom scope patterns. Any URL matching the following patterns will be included as part of the scope. Pattern evaluated against the document URL. | ".*\\.example.com" |
userAgent | yes | Aspider - The Aspire Web Crawler | No | User Agent request header to identify the web crawler. | "Aspider - The Aspire Web Crawler" |
obeyRobots | no | true | No | If checked, the crawler will obey the robots.txt restrictions of each site. | true |
obeyMetaRobots | no | true | No | If checked, the crawler will obey the HTML robots meta tags. | true |
caseSensitiveUrls | no | true | No | If unchecked, all URLs will be transformed into lower-case before processing. | true |
maxHops | yes | 5 | No | Crawl Depth. How many hops from the seed is the crawler allowed to follow links. | 5 |
maxOutLinks | yes | 6000 | No | Maximum number of links to be reported by a single page. | 6000 |
extractValueProps | no | false | No | Extract value attributes. If checked, the crawler will extract links from value attributes (i.e. input tags). | yes |
followRedirects | no | false | No | If checked, redirects will be followed and content will be set as it was from original URL. Otherwise the redirects will enqueue new document. | yes |
extractJavaScript | no | true | No | Crawl JavaScript URIs. If checked, in-page Javascript is scanned for strings that appear likely to be URIs. This typically finds both valid and invalid URIs. | yes |
includes | no | - | Yes | The document will be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL. | ".*\\.pdf" |
excludes | no | - | Yes | The document will not be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL. | ".*\\.xml" |
scanExcludedItems | no | false | No | Scan excluded pages". If checked, the crawler will scan the links of pages that have been excluded by a pattern (because it matches an exclude pattern or because it doesn't match an include pattern). | true |
absoluteExclude | yes for scanExcludedItems = true | - | Yes | Do not follow patterns. URL patterns that the crawler must not scan (follow). This only applies to items marked as excluded by the include/exclude rules. | ".*\\.xml$" |
Document processing | |||||
cleanupRule | no | - | Yes (see fields below) | Content cleanup rules. Specific behavior will apply to the URLs that match the following patterns. | |
urlPattern | yes | - | No | The URL will be matched against this pattern to check if it should be cleansed. | ".*\.xml$" |
contentTypes | yes | - | No | Regular expression evaluated against the document mime type to check if the document should be cleansed. | "text/html\\.*" |
noIndexClassnames | no | - | No | Comma separated list of CSS classes that will be removed from the page content. | "noindex, nofollow" |
cleanupPattern | no | - | No | Regular expression to remove matching text from the page the content. | "<!-- noindex -->.*<!-- /noindex -->" |
cleanupBeforeExtraction | no | true | No | Clean up before link discovery. If checked, the content cleanup will be before discovering links from the page. | true |
Crawler | |||||
maxContentSize | yes | 10mb | No | Max content size for a page. Maximum content size allowed to be fetched. | 15mb |
showNon200AsErrors | no | true | No | Show 400s and 500s status codes as errors. Uncheck if you want to only mark those URLs as "excluded" instead of "errored". | true |
stopOnScanError | no | true | No | Stop on scan error. If unchecked, scan errors will stop the crawl from continuing | true |
logCrawledUrls | no | true | No | Log crawled URLs. If checked a log with all the crawled urls will be created. | true |
debugContentOutput | no | false | No | Write contents to file. If checked, the crawler will write every page in the local file system. The folder to where the files will be created is \"data/CONTENT-SOURCE-NAME/output\". | true |
incrementalUrlCleanupRegex | no | - | No | Url cleanup for incremental. Regex for cleaning up the url in case of dynamicly generated parameters. This is to prevent the incremental crawls to consider the urls as different documents when the only difference are dynamic parameters. For example http://myhost/my-page.html?mydinamic=123456 gets transformed to http://myhost/my-page.html for incremental purposes, but the original url is still going to be used for fetching. | "\\?.*" |
excludeMultimedia | no | true | No | Reject Images / Videos / Javascript / CSS. If checked js, css, swf, gif, png, jpg, jpeg, bmp, mp3, mp4, avi, mpg and mpeg files will be excluded from the crawl. | false |
Connection | |||||
requestHeader | no | - | yes (see fields below) | Custom HTTP headers. This headers will be included on each request made by the crawler. | |
header | yes | - | No | Name of the header | "myCustomHeader" |
value | yes | - | No | value of the header | "myCustomValue" |
trustAllCertificates | no | false | No | Trust all HTTPS certificates. If checked, trust all security certificates (https) by default. | true |
connectionTimeout | yes | 10s | No | Timeout used when a connection is established. | 20s |
connectionRequestTimeout | yes | 10s | No | Timeout used when requesting a connection from the connection manager. | 15s |
socketTimeout | yes | 10s | no | Timeout used for waiting for data. (Maximum period inactivity between two consecutive data packets.) | 10s |
useProxy | no (see fields below) | false | no | Use proxy. If checked, the crawler will connect through a proxy. | yes |
proxyHost | yes | - | no | Proxy hostname | "your-proxy.domain.com" |
proxyPort | yes | 8080 | no | Proxy port | 8080 |
proxyAuthentication | no | none | no | Proxy authentication mechanism used by the crawler. (none/Basic,NTLM) | "Basic" |
Basic | |||||
proxyUser | yes | - | no | Proxy username | "user" |
proxyPassword | yes | - | no | Proxy password | "password" |
NTLM | |||||
proxyDomain | yes | - | no | Proxy username domain | |
proxyUser | yes | - | no | Proxy username | "user" |
proxyPassword | yes | - | no | Proxy password | "password" |
Security | |||||
staticAcl | no | - | yes (see fields below) | Static ACLs. These ACLs will be added to all of the documents. | |
name | yes | - | no | name of the ACL | "john.doe" |
domain | no | - | no | Domain to which ACL belongs to | "domain" |
entity | no | user | no | Whether or not this ACL is for a group or a user. (user/group) | "group" |
access | no | allow | no | Whether or not this ACL will have access to crawled files (allow/deny) | "deny" |
{ "type": "Aspider", "description": "Aspider Test Connector", "credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "properties": { "crawlScope": "CUSTOM", "scopePattern": ".*\\.example.com", "userAgent": "Aspider - The Aspire Web Crawler", "obeyRobots": true, "obeyMetaRobots": true, "caseSensitiveUrls": true, "maxHops": 5, "maxOutLinks": 6000, "extractValueProps": true, "followRedirects": true, "extractJavaScript": true, "includes": ".*\\.pdf", "excludes": ".*\\.xml", "scanExcludedItems": true, "absoluteExclude": ".*\\.xml$", "cleanupRule": [ { "urlPattern": ".*\\xml", "contentTypes": "text/html\\.*", "noIndexClassnames": "nofollow", "cleanupPattern": "<!--noindex-->", "cleanupBeforeExtraction": true } ], "maxContentSize": "10mb", "showNon200AsErrors": true, "stopOnScanError": true, "logCrawledUrls": true, "debugContentOutput": true, "incrementalUrlCleanupRegex": "\\?.*", "excludeMultimedia": true, "requestHeader": [ { "header": "customHeader", "value": "val" } ], "trustAllCertificates": false, "connectTimeout": "10s", "connectionRequestTimeout": "10s", "socketTimeout": "10s", "useProxy": true, "proxyHost": "proxy.domain", "proxyPort": 8080, "proxyAuthentication": "NTLM", "proxyDomain": "sss", "proxyUser": "sss", "proxyPassword": "sss", "staticAcl": [ { "name": "acl", "domain": "domain", "entity": "group", "access": "allow" } ] } }
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | ID of the connection to update. | "89d6632a-a296-426c-adb0-d442adcab4b0", |
description | No | - | No | Name of the connection object. | "MyAspiderConnection" |
credential | No | - | No | ID of the credential assigned to this object. | "2a5ca234-e328-4d40-bb2a-2df3e550b065" |
throttlePolicy | No | - | No | ID of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The iDs of the routing policies that this connection will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
properties | Yes | - | No | Configuration object | |
(see create connection) |
{ "id": "89d6632a-a296-426c-adb0-d442adcab4b0", "description": "Aspider Test Connector", "credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065", "properties": { "crawlScope": "CUSTOM", "scopePattern": ".*\\.example.com", "userAgent": "Aspider - The Aspire Web Crawler", "obeyRobots": true, "obeyMetaRobots": true, "caseSensitiveUrls": true, "maxHops": 5, "maxOutLinks": 6000, "extractValueProps": true, "followRedirects": true, "extractJavaScript": true, "includes": ".*\\.pdf", "excludes": ".*\\.xml", "scanExcludedItems": true, "absoluteExclude": ".*\\.xml$", "cleanupRule": [ { "urlPattern": ".*\\xml", "contentTypes": "text/html\\.*", "noIndexClassnames": "nofollow", "cleanupPattern": "<!--noindex-->", "cleanupBeforeExtraction": true } ], "maxContentSize": "10mb", "showNon200AsErrors": true, "stopOnScanError": true, "logCrawledUrls": true, "debugContentOutput": true, "incrementalUrlCleanupRegex": "\\?.*", "excludeMultimedia": true, "requestHeader": [ { "header": "customHeader", "value": "val" } ], "trustAllCertificates": false, "connectTimeout": "10s", "connectionRequestTimeout": "10s", "socketTimeout": "10s", "useProxy": true, "proxyHost": "proxy.domain", "proxyPort": 8080, "proxyAuthentication": "NTLM", "proxyDomain": "sss", "proxyUser": "sss", "proxyPassword": "sss", "staticAcl": [ { "name": "acl", "domain": "domain", "entity": "group", "access": "allow" } ] } }
For the creation of the Connector object using the Rest API, check this page
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
seed | Yes | - | No | URL where the crawl will start. | "https://example.com" |
type | Yes | - | No | The value must be "aspider". | "aspider" |
description | Yes | - | No | Name of the seed object. | "My Aspider Seed" |
connector | Yes | - | No | The ID of the connector to be used with this seed. The connector type must match the seed type. | "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31" |
connection | Yes | - | No | The ID of the connection to be used with this seed. The connection type must match the seed type. | "602d3700-28dd-4a6a-8b51-e4a663fe9ee6" |
workflows | No | [ ] | Yes | The IDs of the workflows that will be executed for the documents crawled. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
throttlePolicy | No | - | No | ID of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The IDs of the routing policies that this seed will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
tags | No | [ ] | Yes | The tags of the seed. These can be used to filter the seed | ["tag1", "tag2"] |
properties | Yes | - | No | Configuration object | |
isSitemap | no | false | no | Sitemap URL. Check if the start URL is for a sitemap. | false |
{ "type": "Aspider", "seed": "https://www.autoopravna-lahoda.cz/", "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31", "description": "Aspider_Test_Seed", "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b", "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"], "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6", "workflows": ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"], "tags": ["tag1", "tag2"], "properties": { "isSitemap": false } }
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
id | Yes | - | No | ID of the seed to update. | "2f287669-d163-4e35-ad17-6bbfe9df3778" |
seed | Yes | - | No | URL where the crawl will start. | "https://example.com" |
description | No | - | No | Name of the seed object. | "My<connector>Seed" |
connector | No | - | No | The ID of the connector to be used with this seed. The connector type must match the seed type. | "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31" |
connection | No | - | No | The ID of the connection to be used with this seed. The connection type must match the seed type. | "602d3700-28dd-4a6a-8b51-e4a663fe9ee6" |
workflows | No | [ ] | Yes | The IDs of the workflows that will be executed for the documents crawled. | ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"] |
throttlePolicy | No | - | No | ID of the throttle policy that applies to this connection object. | "f5587cee-9116-4011-b3a9-6b235b333a1b" |
routingPolicies | No | [ ] | Yes | The IDs of the routing policies that this seed will use. | ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"] |
tags | No | [ ] | Yes | The tags of the seed. These can be used to filter the seed | ["tag1", "tag3"] |
properties | Yes | - | No | Configuration object | |
(see create seed) |
{ "id": "2f287669-d163-4e35-ad17-6bbfe9df3778", "seed": "https://example.com", "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31", "description": "Aspider_Test_Seed", "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b", "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"], "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6", "workflows": ["b255e950-1dac-46dc-8f86-1238b2fbdf27", "f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"], "tags": ["tag", "tag2"], "properties": { "isSitemap": false } }