You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »



The Aspider Web Crawler connector will crawl content from any given Web Site.

Aspider is based on the Heritrix HTML parser for links discovery, but relies on the Aspire 3 Connector Framework to handle the connections and distributed crawls.

Aspider is highly configurable and behaves better for intranet crawls in comparison to the Heritrix crawler.

Features


Some of the features of the Aspider Web Crawler connector include:

  • HTTP Authentication
    • Basic/Digest
    • NTLM
    • Negotiate/Kerberos
    • HTML Forms (Cookie based)
  • Connection throttling
  • Incremental crawl
  • Ignore/Respect robots.txt and robots meta-tags
  • Heritrix HTML parser for link extraction
  • Connection proxy
  • Configurable User-Agent
  • Max Crawl Depth
  • Distributed Crawling
  • Include/Exclude patterns
  • HTTPS crawling

Content Retrieved


The Aspider Web Crawler connector retries several types of documents, listed bellow are some examples of documents retrieved by this crawler.

  • HTML pages
    • .html
    • .aspx
    • .php
    • etc
  • scripts and stylesheets
    • .js
    • .css
    • etc
  • images
    • .jpg
    • .gif
    • .png
    • etc


This crawler will retrieve any document found linked in the HTML Markup as links.

Limitations 


Due to design implementation, Aspider Web Crawler has the following limitations:

  • Dynamic generated markup
    • Any markup generated by the browser by executing a site's javascript will NOT be detected by the crawler, so dynamic links will not be discovered.


Anything we should add? Please let us know.


  • No labels