You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

The Aspider web crawler will crawl content from a website.


Introduction


The Aspider web crawler will crawl content from any given website.

It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.


Framework and Connector Features


Framework Features

Name Supported
Content Crawlingyes
Identity Crawlingno
Snapshot-based Incrementalsyes
Non-snapshot-based Incrementalsno
Document Hierarchyno

Web Crawler Features

The Aspider web crawler has the following features:

  • Supports multiple authentication methods:
    • Basic
    • Digest
    • NTLM
    • Negotiate/Kerberos
    • HTTP Cookies
      • HTML Forms
      • Selenium
  • Configurable user agent.
  • Max crawl depth.
  • Flexible crawl scope.


Content Crawled


The Aspider web crawler is able to crawl the following objects:

NameTypeRelevant MetadataContent Fetch and ExtractionDescription
Web PagedocumentHTML Meta tags, HTTP headersYesPages discovered on the target website

Limitations


The Aspider web crawler has the following limitations:

  • There is no support for web sites with dynamic content, for such web sites please refer to the Selenium web crawler.
  • There is no support for orphan pages, any page cut from the scope because the link pointing to them no longer exist, will be removed once the conditions are met.
  • Selenium based authentication requires the corresponding web browser to be installed in the server where the crawl will be executed, as well the corresponding web driver to be downloaded according to the browser version.
  • No labels