Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

The Aspider Web Crawler connector will crawl content from any given

Web Site

website.

Aspider is based on

the parser

Parser for links discovery, but relies on

the the

connections and distributed crawls.

Aspider is highly configurable and behaves better for intranet crawls in comparison to

the crawler.

On this page:

Table of Contents

Features


Some of the features of the Aspider Web Crawler connector include:

  • HTTP Authentication
    • Basic/Digest
    • NTLM
    • Negotiate/Kerberos
    • HTML
  • Forms
    • forms (
  • Cookie
    • cookie-based)
    • Connection throttling
  • Incremental crawl
    • Ignore/Respect robots.txt and robots meta
  • -
    • tags
  • Heritrix HTML parser for link extraction
    • Connection proxy
  • Configurable User -Agentagent
    • Max Crawl
  • Depth
    • depth
  • Distributed Crawlingcrawling
      • Include/Exclude patterns
  • HTTPS crawling


Content Retrieved


The Aspider Web Crawler connector retries several types of documents, listed bellow . Listed below are some examples of documents retrieved by this crawler.

  • HTML pages
    • html, aspx, php, etc.
  • scripts Scripts and stylesheets
    • js, css, etc.
  • imagesImages
    • jpg, gif, png, etc.


Info

This crawler will retrieve any document found linked in the HTML Markup as links (i.e. such as PDFs, MS Word, MS PowerPoint, etc).

Limitations 


Due to design to the design implementation, Aspider Web Crawler has the following limitations:

  • Dynamic generated markup
    • Any markup generated by the browser by executing a site's javascript will NOT be detected by the crawler, so dynamic links will not be discovered.


Anything we should add? Please let us know.