You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »



The Aspider Web Crawler connector will crawl content from any given Web Site. 

Features


Some of the features of the Aspider Web Crawler connector include:

  • HTTP Authentication
    • Basic/Digest
    • NTLM
    • Negotiate/Kerberos
    • HTML Forms (Cookie based)
  • Connection throttling
  • Incremental crawl
  • Ignore/Respect robots.txt and robots meta-tags
  • Heritrix HTML parser for link extraction
  • Connection proxy
  • Configurable User-Agent
  • Max Crawl Depth
  • Distributed Crawling
  • Include/Exclude patterns
  • HTTPS crawling

Content Retrieved


The Aspider Web Crawler connector retries several types of documents, listed bellow are some examples of documents retrieved by this crawler.

Include

  • HTML pages
    • .html
    • .aspx
    • .php
    • etc
  • scripts and stylesheets
    • .js
    • .css
    • etc
  • images
    • .jpg
    • .gif
    • .png
    • etc


This crawler will retrieve any document found linked in the HTML Markup as links.

Limitations 


Due to design implementation, Aspider Web Crawler has the following limitations:

  • Dynamic generated markup
    • Any markup generated by the browser by executing a site's javascript will NOT be detected by the crawler, so dynamic links will not be discovered.


Anything we should add? Please let us know.


  • No labels