The Aspider Web Crawler Connector can be configured using the Rest API. It requires the following entities to be created:

  • Credential
  • Connection
  • Connector
  • Seed

Below are the examples of how to create the Connection and the Seed. For the Connector, please check this page.

Create Credential (Common)


FieldRequiredDefaultMultipleNotesExample
typeYes-NoThe value must be "aspider"."aspider"
descriptionYes-NoName of the credential object."AspiderCredential"
propertiesYes-NoConfiguration object
useSeleniumNofalseNoFlag to let Aspider know if it has to set up Seleniumtrue / false
webDriverImplementationYes-No

Browser used by selenium.

Possible values:

  • CHROME
  • FIREFOX

Note: Only used if useSelenium is set to true.

  • "CHROME"
  • "FIREFOX"
webDriverPathYes-No

Path to the selenium web driver executable.

Note: The driver must have execution permission.

Note: Only used if useSelenium is set to true.

"lib\\chromedriver.exe"
headlessModeYes-No

Flag to start the browser on headless mode (no GUI).

Note: Only used if useSelenium is set to true.

true / false
authMechNo

[]

YesArray containing the authentication mechanisms

[]

hostNo""No

Hostname where the authentication mechanism should be used.

Note: If empty, the authentication mechanism will be used against any host.

"example.com"
portYes-1No

Port where the authentication mechanism should be used.

Note: -1 means the URL can have any port. 

8000
schemeYes-No

Scheme to use during the authentication.

Possible values:

  • Basic
  • Digest
  • NTLM
  • Negotiate
  • Forms
  • Selenium
  • "Basic"
  • "Digest"
  • "NTLM"
  • "Negotiate"
  • "Forms"
  • "Selenium"
userYes-NoName of the account to authenticate with."Administrator"
passwordYes-No

Password to authenticate with.

See Encryption API for more information. 

"123456abC"
domainNo""NoDomain of the account used to authenticate with."EXAMPLE"
realmNo""NoRealm of the account to authenticate with. "my-realm"

Create Credential - NTLM specific fields

FieldRequiredDefaultMultipleNotesExample
adfsYesfalseNoFlag to indicate if ADFS should be used, only required when scheme is "NTLM".true / false

Create Credential - Negotiate specific fields

FieldRequiredDefaultMultipleNotesExample
useDefaultKrb5NotrueNoFlag to indicate if Aspider should use the system settings for Kerberos.true / false
kdcYes-NoHostname of the key distribution center to get the Kerberos tickets."kdc.example.com"
verboseYesfalseNoFlag to indicate if the entire negotiation process should be logged.true / false

Create Credential - Forms specific fields

FieldRequiredDefaultMultipleNotesExample
loginUrlYes-NoURL of the login page"https://example.com/login"
formPathYes-NoCSS Selector for getting the login form."#content > form"
userFieldYes-NoID of the username field"txtUser"
passwordFieldYes-NoID of the password field"txtPass"
adfsNofalseNoFlag to enable the ADFS flow of requests during authentication.true / false
samlNofalseNoFlag to enable the SAML flow of requests during authentication.true / false
retriesYes-NoNumber of retries to do if the authentication fails.5
customFieldNo

[]

YesArray of other fields in the form[]
nameYes-NoName of the field in the form"myField"
valueYes-NoValue of the field in the form"myValue"

Create Credential - Selenium specific fields

FieldRequiredDefaultMultipleNotesExample
loginUrlYes-NoURL of the login page"https://example.com/login"
userYes-Nousername to authenticate with"user"
passwordYes-Nopassword"123456abC"
loginScriptNo-No

Script with the instructions to log in. Available variables:

  • seedId: String, ID of the current seed
  • logger: ALogger, logger implementation
  • driver: WebDriver, Selenium web driver, used for interacting with the browser
  • loginUrl: String, URL of the login page
  • user: String, username for authentication
  • password: String, password for authentication


sessionScriptNo-NoThe script with the instructions to validate the session, must return true if the session is valid. For available variables, see above.

Example

POST aspire/_api/credentials
{
    "type": "Aspider",
    "description": "AspiderCredential",
    "properties": {
      "useSelenium": true,
      "webDriverImplementation": "CHROME",
      "webDriverPath": "/dev/chromedriver.exe",
      "headlessMode": false,
      "authMech": [
       {
         "host": "chessbase.com",
         "port": -1,
         "scheme": "Basic",
         "user": "userP",
         "password": "passwd",
         "domain": "",
         "realm": ""
       }
     ]
   }
 }

Update Credential


FieldRequiredDefaultMultipleNotesExample
idYes-NoId of the credential to update."2f287669-d163-4e35-ad17-6bbfe9df3778"
descriptionYes-NoName of the credential object."Aspider Credential"
propertiesYes-NoConfiguration object
(see create credential)




Example

PUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065
{
    "id": "2a5ca234-e328-4d40-bb2a-2df3e550b065",
    "description": "AspiderCredential",
     "properties": {
      "useSelenium": true,
      "webDriverImplementation": "CHROME",
      "webDriverPath": "/dev/chromedriver.exe",
      "headlessMode": false,
      "authMech": [
       {
         "host": "chessbase.com",
         "port": -1,
         "scheme": "Basic",
         "user": "userP",
         "password": "passwd",
         "domain": "",
         "realm": ""
       }
     ]
   }
 }


Create Connection


Field

Required

Default

Multiple

Notes

Example

typeYes-NoThe value must be "aspider"."aspider"
descriptionYes-NoName of the connection object."My Aspider connection"
credentialYes-NoID of the credential assigned to this object."2a5ca234-e328-4d40-bb2a-2df3e550b065"
throttlePolicyNo-NoID of the throttle policy that applies to this connection object."f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPoliciesNo[ ]YesThe IDs of the routing policies that this connection will use.["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
propertiesYes-NoConfiguration object

Scope

crawlScopeyesHOSTNo

Determines the scope of the crawl.  The values are HOST, EVERYTHING, CUSTOM.

  • HOST will allow only URLs like 'abc.domain.com' when your seeds contain 'abc.domain.com'.
  • EVERYTHING will allow all URLs to be crawled.
  • CUSTOM scope allows you to add one or more regular expression patterns to match the host name.

These scopes can be extended by the Include patterns.

"HOST"
scopePatternyes for crawl scope "CUSTOM"-YesFor Custom scope patterns. Any URL matching the following patterns will be included as part of the scope. Pattern evaluated against the document URL.".*\\.example.com"
userAgentyesAspider - The Aspire Web CrawlerNoUser Agent request header to identify the web crawler."Aspider - The Aspire Web Crawler"
obeyRobotsnotrueNoIf checked, the crawler will obey the robots.txt restrictions of each site.true
obeyMetaRobotsnotrueNoIf checked, the crawler will obey the HTML robots meta tags.true
caseSensitiveUrlsnotrueNoIf unchecked, all URLs will be transformed into lower-case before processing.true
maxHopsyes5NoCrawl Depth. How many hops from the seed is the crawler allowed to follow links.5
maxOutLinksyes6000NoMaximum number of links to be reported by a single page.6000
extractValuePropsnofalseNoExtract value attributes. If checked, the crawler will extract links from value attributes (i.e. input tags).yes
followRedirectsnofalseNoIf checked, redirects will be followed and content will be set as it was from original URL. Otherwise the redirects will enqueue new document.yes
extractJavaScriptnotrueNoCrawl JavaScript URIs. If checked, in-page Javascript is scanned for strings that appear likely to be URIs. This typically finds both valid and invalid URIs.yes
includesno-YesThe document will be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.".*\\.pdf"
excludesno-YesThe document will not be processed by the connector if it matches one of the following patterns. Pattern evaluated against the document URL.".*\\.xml"
scanExcludedItemsnofalseNoScan excluded pages". If checked, the crawler will scan the links of pages that have been excluded by a pattern (because it matches an exclude pattern or because it doesn't match an include pattern).true
absoluteExcludeyes for scanExcludedItems = true-YesDo not follow patterns. URL patterns that the crawler must not scan (follow). This only applies to items marked as excluded by the include/exclude rules.".*\\.xml$"

Document processing

cleanupRuleno-Yes (see fields below)Content cleanup rules. Specific behavior will apply to the URLs that match the following patterns.
urlPatternyes-NoThe URL will be matched against this pattern to check if it should be cleansed.".*\.xml$"
contentTypesyes-NoRegular expression evaluated against the document mime type to check if the document should be cleansed."text/html\\.*"
noIndexClassnamesno-NoComma separated list of CSS classes that will be removed from the page content."noindex, nofollow"
cleanupPatternno-NoRegular expression to remove matching text from the page the content."<!-- noindex -->.*<!-- /noindex -->"
cleanupBeforeExtractionnotrueNoClean up before link discovery. If checked, the content cleanup will be before discovering links from the page.true

Crawler

maxContentSizeyes10mbNoMax content size for a page. Maximum content size allowed to be fetched.15mb
showNon200AsErrorsnotrueNoShow 400s and 500s status codes as errors. Uncheck if you want to only mark those URLs as "excluded" instead of "errored".true
stopOnScanErrornotrueNoStop on scan error. If unchecked, scan errors will stop the crawl from continuingtrue
logCrawledUrlsnotrueNoLog crawled URLs. If checked a log with all the crawled urls will be created.true
debugContentOutputnofalseNoWrite contents to file. If checked, the crawler will write every page in the local file system. The folder to where the files will be created is \"data/CONTENT-SOURCE-NAME/output\".true
incrementalUrlCleanupRegexno-NoUrl cleanup for incremental. Regex for cleaning up the url in case of dynamicly generated parameters. This is to prevent the incremental crawls to consider the urls as different documents when the only difference are dynamic parameters. For example http://myhost/my-page.html?mydinamic=123456 gets transformed to http://myhost/my-page.html for incremental purposes, but the original url is still going to be used for fetching."\\?.*"
excludeMultimedianotrueNoReject Images / Videos / Javascript / CSS. If checked js, css, swf, gif, png, jpg, jpeg, bmp, mp3, mp4, avi, mpg and mpeg files will be excluded from the crawl.false

Connection

requestHeaderno-yes (see fields below)Custom HTTP headers. This headers will be included on each request made by the crawler.
headeryes-NoName of the header"myCustomHeader"
valueyes-Novalue of the header"myCustomValue"






trustAllCertificatesnofalseNoTrust all HTTPS certificates. If checked, trust all security certificates (https) by default.true
connectionTimeoutyes10sNoTimeout used when a connection is established.20s
connectionRequestTimeoutyes10sNoTimeout used when requesting a connection from the connection manager.15s
socketTimeoutyes10snoTimeout used for waiting for data. (Maximum period inactivity between two consecutive data packets.)10s
useProxyno (see fields below)falsenoUse proxy. If checked, the crawler will connect through a proxy.yes
proxyHostyes-noProxy hostname"your-proxy.domain.com"
proxyPortyes8080noProxy port8080
proxyAuthenticationnononenoProxy authentication mechanism used by the crawler. (none/Basic,NTLM)"Basic"

Basic

proxyUseryes-noProxy username"user"
proxyPasswordyes-noProxy password"password"

NTLM

proxyDomainyes-noProxy username domain
proxyUseryes-noProxy username"user"
proxyPasswordyes-noProxy password"password"

Security

staticAclno-yes (see fields below)Static ACLs. These ACLs will be added to all of the documents.
nameyes-noname of the ACL"john.doe"
domainno-noDomain to which ACL belongs to"domain"
entitynousernoWhether or not this ACL is for a group or a user. (user/group)"group"
accessnoallownoWhether or not this ACL will have access to crawled files (allow/deny)"deny"

Example

POST aspire/_api/connections
{
    "type": "Aspider",
    "description": "Aspider Test Connector",
	"credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065",	
    "properties": {
      "crawlScope": "CUSTOM",
      "scopePattern": ".*\\.example.com",
      "userAgent": "Aspider - The Aspire Web Crawler",
      "obeyRobots": true,
      "obeyMetaRobots": true,
      "caseSensitiveUrls": true,
      "maxHops": 5,
      "maxOutLinks": 6000,
      "extractValueProps": true,
      "followRedirects": true,
      "extractJavaScript": true,
      "includes": ".*\\.pdf",
      "excludes": ".*\\.xml",
      "scanExcludedItems": true,
      "absoluteExclude": ".*\\.xml$",
      "cleanupRule": [
          {
            "urlPattern": ".*\\xml",
            "contentTypes": "text/html\\.*",
            "noIndexClassnames": "nofollow",
            "cleanupPattern": "<!--noindex-->",
            "cleanupBeforeExtraction": true
           }
         ],
      "maxContentSize": "10mb",
      "showNon200AsErrors": true,
      "stopOnScanError": true,
      "logCrawledUrls": true,
      "debugContentOutput": true,
      "incrementalUrlCleanupRegex": "\\?.*",
      "excludeMultimedia": true,
      "requestHeader": [
        {
          "header": "customHeader",
          "value": "val"
         }
       ],
      "trustAllCertificates": false,
      "connectTimeout": "10s",
      "connectionRequestTimeout": "10s",
      "socketTimeout": "10s",
      "useProxy": true,
      "proxyHost": "proxy.domain",
      "proxyPort": 8080,
      "proxyAuthentication": "NTLM",
      "proxyDomain": "sss",
      "proxyUser": "sss",
      "proxyPassword": "sss",
      "staticAcl": [
         {
           "name": "acl",
           "domain": "domain",
           "entity": "group",
           "access": "allow"
          }
        ]
     }
}

Update Connection

Field

Required

Default

Multiple

Notes

Example

idYes-NoID of the connection to update."89d6632a-a296-426c-adb0-d442adcab4b0",
descriptionNo-NoName of the connection object."MyAspiderConnection"
credentialNo-NoID of the credential assigned to this object."2a5ca234-e328-4d40-bb2a-2df3e550b065"
throttlePolicyNo-NoID of the throttle policy that applies to this connection object."f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPoliciesNo[ ]YesThe iDs of the routing policies that this connection will use.["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
propertiesYes-NoConfiguration object
(see create connection)




Example

PUT aspire/_api/connections/89d6632a-a296-426c-adb0-d442adcab4b0
{
    "id": "89d6632a-a296-426c-adb0-d442adcab4b0",
    "description": "Aspider Test Connector",
	"credential": "2a5ca234-e328-4d40-bb2a-2df3e550b065",
     "properties": {
      "crawlScope": "CUSTOM",
      "scopePattern": ".*\\.example.com",
      "userAgent": "Aspider - The Aspire Web Crawler",
      "obeyRobots": true,
      "obeyMetaRobots": true,
      "caseSensitiveUrls": true,
      "maxHops": 5,
      "maxOutLinks": 6000,
      "extractValueProps": true,
      "followRedirects": true,
      "extractJavaScript": true,
      "includes": ".*\\.pdf",
      "excludes": ".*\\.xml",
      "scanExcludedItems": true,
      "absoluteExclude": ".*\\.xml$",
      "cleanupRule": [
          {
            "urlPattern": ".*\\xml",
            "contentTypes": "text/html\\.*",
            "noIndexClassnames": "nofollow",
            "cleanupPattern": "<!--noindex-->",
            "cleanupBeforeExtraction": true
           }
         ],
      "maxContentSize": "10mb",
      "showNon200AsErrors": true,
      "stopOnScanError": true,
      "logCrawledUrls": true,
      "debugContentOutput": true,
      "incrementalUrlCleanupRegex": "\\?.*",
      "excludeMultimedia": true,
      "requestHeader": [
        {
          "header": "customHeader",
          "value": "val"
         }
       ],
      "trustAllCertificates": false,
      "connectTimeout": "10s",
      "connectionRequestTimeout": "10s",
      "socketTimeout": "10s",
      "useProxy": true,
      "proxyHost": "proxy.domain",
      "proxyPort": 8080,
      "proxyAuthentication": "NTLM",
      "proxyDomain": "sss",
      "proxyUser": "sss",
      "proxyPassword": "sss",
      "staticAcl": [
         {
           "name": "acl",
           "domain": "domain",
           "entity": "group",
           "access": "allow"
          }
        ]
     } 
}

Create Connector Instance


For the creation of the Connector object using the Rest API, check this page

Update Connector Instance


For the update of the Connector object using the Rest API, check this page

Create Seed


Field

Required

Default

Multiple

Notes

Example

seedYes-NoURL where the crawl will start."https://example.com"
typeYes-NoThe value must be "aspider"."aspider"
descriptionYes-NoName of the seed object."My Aspider Seed"
connectorYes-NoThe ID of the connector to be used with this seed. The connector type must match the seed type."82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31"
connectionYes-NoThe ID of the connection to be used with this seed. The connection type must match the seed type."602d3700-28dd-4a6a-8b51-e4a663fe9ee6"
workflowsNo[ ]YesThe IDs of the workflows that will be executed for the documents crawled.["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"]
throttlePolicyNo-NoID of the throttle policy that applies to this connection object."f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPoliciesNo[ ]YesThe IDs of the routing policies that this seed will use.["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
tagsNo[ ]YesThe tags of the seed. These can be used to filter the seed["tag1", "tag2"]
propertiesYes-NoConfiguration object
isSitemapnofalsenoSitemap URL. Check if the start URL is for a sitemap.false

Example

POST aspire/_api/seeds
{
    "type": "Aspider",
    "seed": "https://www.autoopravna-lahoda.cz/",
    "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31",
    "description": "Aspider_Test_Seed",
    "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b",
    "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"],
    "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6",
    "workflows": ["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"],
    "tags": ["tag1", "tag2"],
    "properties": {          
       "isSitemap": false
    }
}

Update Seed


Field

Required

Default

Multiple

Notes

Example

idYes-NoID of the seed to update."2f287669-d163-4e35-ad17-6bbfe9df3778"
seedYes-NoURL where the crawl will start."https://example.com"
descriptionNo-NoName of the seed object."My<connector>Seed"
connectorNo-NoThe ID of the connector to be used with this seed. The connector type must match the seed type."82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31"
connectionNo-NoThe ID of the connection to be used with this seed. The connection type must match the seed type."602d3700-28dd-4a6a-8b51-e4a663fe9ee6"
workflowsNo[ ]YesThe IDs of the workflows that will be executed for the documents crawled.["f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"]
throttlePolicyNo-NoID of the throttle policy that applies to this connection object."f5587cee-9116-4011-b3a9-6b235b333a1b"
routingPoliciesNo[ ]YesThe IDs of the routing policies that this seed will use.["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"]
tagsNo[ ]YesThe tags of the seed. These can be used to filter the seed["tag1", "tag3"]
propertiesYes-NoConfiguration object
(see create seed)




Example

PUT aspire/_api/seeds/2f287669-d163-4e35-ad17-6bbfe9df3778
{
    "id": "2f287669-d163-4e35-ad17-6bbfe9df3778",
    "seed": "https://example.com",
    "connector": "82f7f0a4-8d28-47ce-8c9d-e3ca414b0d31",
    "description": "Aspider_Test_Seed",
    "throttlePolicy": "6b8b5f23-fc77-47a1-9b58-106577162e7b",
    "routingPolicies": ["313de87c-3cb9-4fe0-a2cb-17f75ce7d0c7", "b4d2579f-1a0a-4a8b-9fd4-d42780003b36"],
    "connection": "602d3700-28dd-4a6a-8b51-e4a663fe9ee6",
    "workflows": ["b255e950-1dac-46dc-8f86-1238b2fbdf27", "f8c414cb-1f5d-42ef-9cc9-5696c3f0bda4"],
    "tags": ["tag", "tag2"],
    "properties": {       
          "isSitemap": false
      }
}
  • No labels