Create Credential (Common)

Field	Required	Default	Multiple	Notes	Example
type	Yes	-	No	The value must be "aspider".	"aspider"
description	Yes	-	No	Name of the credential object.	"AspiderCredential"
properties	Yes	-	No	Configuration object
useSelenium	No	false	No	Flag to let Aspider know if it has to set up Selenium	true / false
webDriverImplementation	Yes	-	No	Browser used by selenium. Possible values: CHROME FIREFOX Note: Only used if useSelenium is set to true.	"CHROME" "FIREFOX"
webDriverPath	Yes	-	No	Path to the selenium web driver executable. Note: The driver must have execution permission. Note: Only used if useSelenium is set to true.	"lib\\chromedriver.exe"
headlessMode	Yes	-	No	Flag to start the browser on headless mode (no GUI). Note: Only used if useSelenium is set to true.	true / false
authMech	No	[]	Yes	Array containing the authentication mechanisms	[]
host	No	""	No	Hostname where the authentication mechanism should be used. Note: If empty, the authentication mechanism will be used against any host.	"example.com"
port	Yes	-1	No	Port where the authentication mechanism should be used. Note: -1 means the URL can have any port.	8000
scheme	Yes	-	No	Scheme to use during the authentication. Possible values: Basic Digest NTLM Negotiate Forms Selenium	"Basic" "Digest" "NTLM" "Negotiate" "Forms" "Selenium"
user	Yes	-	No	Name of the account to authenticate with.	"Administrator"
password	Yes	-	No	Password to authenticate with. See Encryption API for more information.	"123456abC"
domain	No	""	No	Domain of the account used to authenticate with.	"EXAMPLE"
realm	No	""	No	Realm of the account to authenticate with.	"my-realm"

Create Credential - NTLM specific fields

Field	Required	Default	Multiple	Notes	Example
adfs	Yes	false	No	Flag to indicate if ADFS should be used, only required when *scheme* is "NTLM".	true / false

Create Credential - Negotiate specific fields

Field	Required	Default	Multiple	Notes	Example
useDefaultKrb5	No	true	No	Flag to indicate if Aspider should use the system settings for Kerberos.	true / false
kdc	Yes	-	No	Hostname of the key distribution center to get the Kerberos tickets.	"kdc.example.com"
verbose	Yes	false	No	Flag to indicate if the entire negotiation process should be logged.	true / false

Create Credential - Forms specific fields

Field	Required	Default	Multiple	Notes	Example
loginUrl	Yes	-	No	URL of the login page	"https://example.com/login"
formPath	Yes	-	No	CSS Selector for getting the login form.	"#content > form"
userField	Yes	-	No	ID of the username field	"txtUser"
passwordField	Yes	-	No	ID of the password field	"txtPass"
adfs	No	false	No	Flag to enable the ADFS flow of requests during authentication.	true / false
saml	No	false	No	Flag to enable the SAML flow of requests during authentication.	true / false
retries	Yes	-	No	Number of retries to do if the authentication fails.	5
customField	No	[]	Yes	Array of other fields in the form	[]
name	Yes	-	No	Name of the field in the form	"myField"
value	Yes	-	No	Value of the field in the form	"myValue"

Create Credential - Selenium specific fields

Field	Required	Default	Multiple	Notes	Example
loginUrl	Yes	-	No	URL of the login page	"https://example.com/login"
user	Yes	-	No	username to authenticate with	"user"
password	Yes	-	No	password	"123456abC"
loginScript	No	-	No	Script with the instructions to log in. Available variables: seedId: String, ID of the current seed logger: ALogger, logger implementation driver: WebDriver, Selenium web driver, used for interacting with the browser loginUrl: String, URL of the login page user: String, username for authentication password: String, password for authentication
sessionScript	No	-	No	The script with the instructions to validate the session, must return true if the session is valid. For available variables, see above.

Example

POST aspire/_api/credentials

{
    "type": "Aspider",
    "description": "AspiderCredential",
    "properties": {
      "useSelenium": true,
      "webDriverImplementation": "CHROME",
      "webDriverPath": "/dev/chromedriver.exe",
      "headlessMode": false,
      "authMech": [
       {
         "host": "chessbase.com",
         "port": -1,
         "scheme": "Basic",
         "user": "userP",
         "password": "passwd",
         "domain": "",
         "realm": ""
       }
     ]
   }
 }

Update Credential

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	Id of the credential to update.	"2f287669-d163-4e35-ad17-6bbfe9df3778"
description	Yes	-	No	Name of the credential object.	"Aspider Credential"
properties	Yes	-	No	Configuration object
(see create credential)

Example

PUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065

{
    "id": "2a5ca234-e328-4d40-bb2a-2df3e550b065",
    "description": "AspiderCredential",
     "properties": {
      "useSelenium": true,
      "webDriverImplementation": "CHROME",
      "webDriverPath": "/dev/chromedriver.exe",
      "headlessMode": false,
      "authMech": [
       {
         "host": "chessbase.com",
         "port": -1,
         "scheme": "Basic",
         "user": "userP",
         "password": "passwd",
         "domain": "",
         "realm": ""
       }
     ]
   }
 }

Create Connection

Field	Required	Default	Multiple	Notes	Example
type	Yes	-	No	The value must be "aspider".	"aspider"
description	Yes	-	No	Name of the connection object.	"My Aspider connection"
credential	Yes	-	No	ID of the credential assigned to this object.	"2a5ca234-e328-4d40-bb2a-2df3e550b065"
throttlePolicy	No	-	No	ID of the throttle policy that applies to this connection object.	"f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPolicies	No	[ ]	Yes	The IDs of the routing policies that this connection will use.	["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
properties	Yes	-	No	Configuration object
Scope
crawlScope	yes	HOST	No	Determines the scope of the crawl. The values are HOST, EVERYTHING, CUSTOM. HOST will allow only URLs like 'abc.domain.com' when your seeds contain 'abc.domain.com'. EVERYTHING will allow all URLs to be crawled. CUSTOM scope allows you to add one or more regular expression patterns to match the host name. These scopes can be extended by the Include patterns.	"HOST"
scopePattern	yes for crawl scope "CUSTOM"	-	Yes	For Custom scope patterns. Any URL matching the following patterns will be included as part of the scope. Pattern evaluated against the document URL.	".*\\.example.com"
userAgent	yes	Aspider - The Aspire Web Crawler	No	User Agent request header to identify the web crawler.	"Aspider - The Aspire Web Crawler"
obeyRobots	no	true	No	If checked, the crawler will obey the robots.txt restrictions of each site.	true
obeyMetaRobots	no	true	No	If checked, the crawler will obey the HTML robots meta tags.	true
caseSensitiveUrls	no	true	No	If unchecked, all URLs will be transformed into lower-case before processing.	true
maxHops	yes	5	No	Crawl Depth. How many hops from the seed is the crawler allowed to follow links.	5
maxOutLinks	yes	6000	No	Maximum number of links to be reported by a single page.	6000
extractValueProps	no	false	No	Extract value attributes. If checked, the crawler will extract links from value attributes (i.e. input tags).	yes
followRedirects	no	false	No	If checked, redirects will be followed and content will be set as it was from original URL. Otherwise the redirects will enqueue new document.	yes
extractJavaScript	no	true	No	Crawl JavaScript URIs. If checked, in-page Javascript is scanned for strings that appear likely to be URIs. This typically finds both valid and invalid URIs.	yes
includes	no	-	Yes	The document will be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.	".*\\.pdf"
excludes	no	-	Yes	The document will not be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.	".*\\.xml"
scanExcludedItems	no	false	No	Scan excluded pages". If checked, the crawler will scan the links of pages that have been excluded by a pattern (because it matches an exclude pattern or because it doesn't match an include pattern).	true
absoluteExclude	yes for scanExcludedItems = true	-	Yes	Do not follow patterns. URL patterns that the crawler must not scan (follow). This only applies to items marked as excluded by the include/exclude rules.	".*\\.xml$"
Document processing
cleanupRule	no	-	Yes (see fields below)	Content cleanup rules. Specific behavior will apply to the URLs that match the following patterns.
urlPattern	yes	-	No	The URL will be matched against this pattern to check if it should be cleansed.	".*\.xml$"
contentTypes	yes	-	No	Regular expression evaluated against the document mime type to check if the document should be cleansed.	"text/html\\.*"
noIndexClassnames	no	-	No	Comma separated list of CSS classes that will be removed from the page content.	"noindex, nofollow"
cleanupPattern	no	-	No	Regular expression to remove matching text from the page the content.	"<!-- noindex -->.*<!-- /noindex -->"
cleanupBeforeExtraction	no	true	No	Clean up before link discovery. If checked, the content cleanup will be before discovering links from the page.	true
Crawler
maxContentSize	yes	10mb	No	Max content size for a page. Maximum content size allowed to be fetched.	15mb
showNon200AsErrors	no	true	No	Show 400s and 500s status codes as errors. Uncheck if you want to only mark those URLs as "excluded" instead of "errored".	true
stopOnScanError	no	true	No	Stop on scan error. If unchecked, scan errors will stop the crawl from continuing	true
logCrawledUrls	no	true	No	Log crawled URLs. If checked a log with all the crawled urls will be created.	true
debugContentOutput	no	false	No	Write contents to file. If checked, the crawler will write every page in the local file system. The folder to where the files will be created is \"data/CONTENT-SOURCE-NAME/output\".	true
incrementalUrlCleanupRegex	no	-	No	Url cleanup for incremental. Regex for cleaning up the url in case of dynamicly generated parameters. This is to prevent the incremental crawls to consider the urls as different documents when the only difference are dynamic parameters. For example http://myhost/my-page.html?mydinamic=123456 gets transformed to http://myhost/my-page.html for incremental purposes, but the original url is still going to be used for fetching.	"\\?.*"
excludeMultimedia	no	true	No	Reject Images / Videos / Javascript / CSS. If checked js, css, swf, gif, png, jpg, jpeg, bmp, mp3, mp4, avi, mpg and mpeg files will be excluded from the crawl.	false
Connection
requestHeader	no	-	yes (see fields below)	Custom HTTP headers. This headers will be included on each request made by the crawler.
header	yes	-	No	Name of the header	"myCustomHeader"
value	yes	-	No	value of the header	"myCustomValue"

trustAllCertificates	no	false	No	Trust all HTTPS certificates. If checked, trust all security certificates (https) by default.	true
connectionTimeout	yes	10s	No	Timeout used when a connection is established.	20s
connectionRequestTimeout	yes	10s	No	Timeout used when requesting a connection from the connection manager.	15s
socketTimeout	yes	10s	no	Timeout used for waiting for data. (Maximum period inactivity between two consecutive data packets.)	10s
useProxy	no (see fields below)	false	no	Use proxy. If checked, the crawler will connect through a proxy.	yes
proxyHost	yes	-	no	Proxy hostname	"your-proxy.domain.com"
proxyPort	yes	8080	no	Proxy port	8080
proxyAuthentication	no	none	no	Proxy authentication mechanism used by the crawler. (none/Basic,NTLM)	"Basic"
Basic
proxyUser	yes	-	no	Proxy username	"user"
proxyPassword	yes	-	no	Proxy password	"password"
NTLM
proxyDomain	yes	-	no	Proxy username domain
proxyUser	yes	-	no	Proxy username	"user"
proxyPassword	yes	-	no	Proxy password	"password"
Security
staticAcl	no	-	yes (see fields below)	Static ACLs. These ACLs will be added to all of the documents.
name	yes	-	no	name of the ACL	"john.doe"
domain	no	-	no	Domain to which ACL belongs to	"domain"
entity	no	user	no	Whether or not this ACL is for a group or a user. (user/group)	"group"
access	no	allow	no	Whether or not this ACL will have access to crawled files (allow/deny)	"deny"

Example

POST aspire/_api/connections

{
    "type": "Aspider",
    "description": "Aspider Test Connector",
	"credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065",	
    "properties": {
      "crawlScope": "CUSTOM",
      "scopePattern": ".*\\.example.com",
      "userAgent": "Aspider - The Aspire Web Crawler",
      "obeyRobots": true,
      "obeyMetaRobots": true,
      "caseSensitiveUrls": true,
      "maxHops": 5,
      "maxOutLinks": 6000,
      "extractValueProps": true,
      "followRedirects": true,
      "extractJavaScript": true,
      "includes": ".*\\.pdf",
      "excludes": ".*\\.xml",
      "scanExcludedItems": true,
      "absoluteExclude": ".*\\.xml$",
      "cleanupRule": [
          {
            "urlPattern": ".*\\xml",
            "contentTypes": "text/html\\.*",
            "noIndexClassnames": "nofollow",
            "cleanupPattern": "<!--noindex-->",
            "cleanupBeforeExtraction": true
           }
         ],
      "maxContentSize": "10mb",
      "showNon200AsErrors": true,
      "stopOnScanError": true,
      "logCrawledUrls": true,
      "debugContentOutput": true,
      "incrementalUrlCleanupRegex": "\\?.*",
      "excludeMultimedia": true,
      "requestHeader": [
        {
          "header": "customHeader",
          "value": "val"
         }
       ],
      "trustAllCertificates": false,
      "connectTimeout": "10s",
      "connectionRequestTimeout": "10s",
      "socketTimeout": "10s",
      "useProxy": true,
      "proxyHost": "proxy.domain",
      "proxyPort": 8080,
      "proxyAuthentication": "NTLM",
      "proxyDomain": "sss",
      "proxyUser": "sss",
      "proxyPassword": "sss",
      "staticAcl": [
         {
           "name": "acl",
           "domain": "domain",
           "entity": "group",
           "access": "allow"
          }
        ]
     }
}

Update Connection

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	ID of the connection to update.	"89d6632a-a296-426c-adb0-d442adcab4b0",
description	No	-	No	Name of the connection object.	"MyAspiderConnection"
credential	No	-	No	ID of the credential assigned to this object.	"2a5ca234-e328-4d40-bb2a-2df3e550b065"
throttlePolicy	No	-	No	ID of the throttle policy that applies to this connection object.	"f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPolicies	No	[ ]	Yes	The iDs of the routing policies that this connection will use.	["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
properties	Yes	-	No	Configuration object
(see create connection)

Example

PUT aspire/_api/connections/89d6632a-a296-426c-adb0-d442adcab4b0

{
    "id": "89d6632a-a296-426c-adb0-d442adcab4b0",
    "description": "Aspider Test Connector",
	"credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065",
     "properties": {
      "crawlScope": "CUSTOM",
      "scopePattern": ".*\\.example.com",
      "userAgent": "Aspider - The Aspire Web Crawler",
      "obeyRobots": true,
      "obeyMetaRobots": true,
      "caseSensitiveUrls": true,
      "maxHops": 5,
      "maxOutLinks": 6000,
      "extractValueProps": true,
      "followRedirects": true,
      "extractJavaScript": true,
      "includes": ".*\\.pdf",
      "excludes": ".*\\.xml",
      "scanExcludedItems": true,
      "absoluteExclude": ".*\\.xml$",
      "cleanupRule": [
          {
            "urlPattern": ".*\\xml",
            "contentTypes": "text/html\\.*",
            "noIndexClassnames": "nofollow",
            "cleanupPattern": "<!--noindex-->",
            "cleanupBeforeExtraction": true
           }
         ],
      "maxContentSize": "10mb",
      "showNon200AsErrors": true,
      "stopOnScanError": true,
      "logCrawledUrls": true,
      "debugContentOutput": true,
      "incrementalUrlCleanupRegex": "\\?.*",
      "excludeMultimedia": true,
      "requestHeader": [
        {
          "header": "customHeader",
          "value": "val"
         }
       ],
      "trustAllCertificates": false,
      "connectTimeout": "10s",
      "connectionRequestTimeout": "10s",
      "socketTimeout": "10s",
      "useProxy": true,
      "proxyHost": "proxy.domain",
      "proxyPort": 8080,
      "proxyAuthentication": "NTLM",
      "proxyDomain": "sss",
      "proxyUser": "sss",
      "proxyPassword": "sss",
      "staticAcl": [
         {
           "name": "acl",
           "domain": "domain",
           "entity": "group",
           "access": "allow"
          }
        ]
     } 
}

Create Connector Instance

For the creation of the Connector object using the Rest API, check this page

Update Connector Instance

For the update of the Connector object using the Rest API, check this page

Create Seed

Field	Required	Default	Multiple	Notes	Example
seed	Yes	-	No	URL where the crawl will start.	"https://example.com"
type	Yes	-	No	The value must be "aspider".	"aspider"
description	Yes	-	No	Name of the seed object.	"My Aspider Seed"
connector	Yes	-	No	The ID of the connector to be used with this seed. The connector type must match the seed type.	"82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31"
connection	Yes	-	No	The ID of the connection to be used with this seed. The connection type must match the seed type.	"602d3700-28dd-4a6a-8b51-e4a663fe9ee6"
workflows	No	[ ]	Yes	The IDs of the workflows that will be executed for the documents crawled.	["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"]
throttlePolicy	No	-	No	ID of the throttle policy that applies to this connection object.	"f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPolicies	No	[ ]	Yes	The IDs of the routing policies that this seed will use.	["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
tags	No	[ ]	Yes	The tags of the seed. These can be used to filter the seed	["tag1", "tag2"]
properties	Yes	-	No	Configuration object
isSitemap	no	false	no	Sitemap URL. Check if the start URL is for a sitemap.	false

Example

POST aspire/_api/seeds

{
    "type": "Aspider",
    "seed": "https://www.autoopravna-lahoda.cz/",
    "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31",
    "description": "Aspider_Test_Seed",
    "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b",
    "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"],
    "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6",
    "workflows": ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"],
    "tags": ["tag1", "tag2"],
    "properties": {          
       "isSitemap": false
    }
}

Update Seed

Field	Required	Default	Multiple	Notes	Example
id	Yes	-	No	ID of the seed to update.	"2f287669-d163-4e35-ad17-6bbfe9df3778"
seed	Yes	-	No	URL where the crawl will start.	"https://example.com"
description	No	-	No	Name of the seed object.	"My<connector>Seed"
connector	No	-	No	The ID of the connector to be used with this seed. The connector type must match the seed type.	"82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31"
connection	No	-	No	The ID of the connection to be used with this seed. The connection type must match the seed type.	"602d3700-28dd-4a6a-8b51-e4a663fe9ee6"
workflows	No	[ ]	Yes	The IDs of the workflows that will be executed for the documents crawled.	["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"]
throttlePolicy	No	-	No	ID of the throttle policy that applies to this connection object.	"f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPolicies	No	[ ]	Yes	The IDs of the routing policies that this seed will use.	["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
tags	No	[ ]	Yes	The tags of the seed. These can be used to filter the seed	["tag1", "tag3"]
properties	Yes	-	No	Configuration object
(see create seed)

Example

PUT aspire/_api/seeds/2f287669-d163-4e35-ad17-6bbfe9df3778

{
    "id": "2f287669-d163-4e35-ad17-6bbfe9df3778",
    "seed": "https://example.com",
    "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31",
    "description": "Aspider_Test_Seed",
    "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b",
    "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"],
    "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6",
    "workflows": ["b255e950-1dac-46dc-8f86-1238b2fbdf27", "f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"],
    "tags": ["tag", "tag2"],
    "properties": {       
          "isSitemap": false
      }
}

Page tree

Create Credential (Common)

Create Credential - NTLM specific fields

Create Credential - Negotiate specific fields

Create Credential - Forms specific fields

Create Credential - Selenium specific fields

Example

Update Credential

Example

Create Connection

Example

Update Connection

Example

Create Connector Instance

Update Connector Instance

Create Seed

Example

Update Seed

Example

Contact Us: [email protected]

Page tree

Rest API - Aspider Web Crawler Configuration

Create Credential (Common)

Create Credential - NTLM specific fields

Create Credential - Negotiate specific fields

Create Credential - Forms specific fields

Create Credential - Selenium specific fields

Example

Update Credential

Example

Create Connection

Example

Update Connection

Example

Create Connector Instance

Update Connector Instance

Create Seed

Example

Update Seed

Example

Contact Us: [email protected]