The Aspider web crawler will crawl content from a website.


Introduction


The Aspider web crawler will crawl content from any given website.

It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.


Framework and Connector Features


Framework Features

Name Supported
Content Crawlingyes
Identity Crawlingno
Snapshot-based Incrementalsyes
Non-snapshot-based Incrementalsno
Document Hierarchyno

Web Crawler Features

The Aspider web crawler has the following features:

  • Supports multiple authentication methods:
    • Basic
    • Digest
    • NTLM
    • Negotiate/Kerberos
    • HTTP Cookies
      • HTML Forms
      • Selenium (scriptable steps to log in with a real browser instance)
  • Configurable user agent header.
  • Max crawl depth.
  • Flexible crawl scope (host-only, domain-only, everything).
  • Robots policies (robots.txt and meta-tags) for inclusion/exclusion of pages
  • Include/Exclude regex patterns
  • Content cleanup (based on regex rules)

Allows removing parts of the pages that are not required to be indexed or contains links to pages that the crawler shouldn't follow.

  • URL-cleanup for incrementals

This feature removes parts of the URL that may change overtime without the contents of the document change, for instance access signatures, or session ids.

For instance, if a parent page contains the following link:

    • http://somesite.com/doc.pdf?accessBy=19350286

 But the accessBy parameter changes every 2 hours even if the doc.pdf file does not change. The accessBy parameter can be cleared from the URL used to identify the document so subsequent incremental crawls knows they are in fact the same document, and will verify the document's signature based on its contents.

  • Sitemap parsing

Seed URLs may point to HTML pages or sitemap XML pages. If a sitemap is used as a seed, the crawler will parse the XML file and detect where it needs to go based on it.


Content Crawled


The Aspider web crawler is able to crawl the following objects:

NameTypeRelevant MetadataContent Fetch and ExtractionDescription
Web PagedocumentHTML Meta tags, HTTP headersYesPages discovered on the target website

Limitations


The Aspider web crawler has the following limitations:

  • There is no support for websites with dynamic content. For such websites, please refer to the Selenium web crawler.
  • There is no support for orphan pages, any page cut from the scope because the link pointing to them no longer exist, will be removed once the conditions are met.
  • Selenium based authentication requires the corresponding web browser to be installed in the server where the crawl will be executed, as well the corresponding web driver to be downloaded according to the browser version.
  • No labels