The Aspider web crawler will crawl content from a website.

Introduction

The Aspider web crawler will crawl content from any given website.

It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.

Framework and Connector Features

Framework Features

Name	Supported
Content Crawling	yes
Identity Crawling	no
Snapshot-based Incrementals	yes
Non-snapshot-based Incrementals	no
Document Hierarchy	no

Web Crawler Features

The Aspider web crawler has the following features:

Supports multiple authentication methods:
- Basic
- Digest
- NTLM
- Negotiate/Kerberos
- HTTP Cookies
  - HTML Forms
  - Selenium (scriptable steps to log in with a real browser instance)
Configurable user agent header.
Max crawl depth.
Flexible crawl scope (host-only, domain-only, everything).
Robots policies (robots.txt and meta-tags) for inclusion/exclusion of pages
Include/Exclude regex patterns
Content cleanup (based on regex rules)

Allows removing parts of the pages that are not required to be indexed or contains links to pages that the crawler shouldn't follow.

URL-cleanup for incrementals

This feature removes parts of the URL that may change overtime without the contents of the document change, for instance access signatures, or session ids.

For instance, if a parent page contains the following link:

- http://somesite.com/doc.pdf?accessBy=19350286

But the accessBy parameter changes every 2 hours even if the doc.pdf file does not change. The accessBy parameter can be cleared from the URL used to identify the document so subsequent incremental crawls knows they are in fact the same document, and will verify the document's signature based on its contents.

Sitemap parsing

Seed URLs may point to HTML pages or sitemap XML pages. If a sitemap is used as a seed, the crawler will parse the XML file and detect where it needs to go based on it.

Content Crawled

The Aspider web crawler is able to crawl the following objects:

Name	Type	Relevant Metadata	Content Fetch and Extraction	Description
Web Page	document	HTML Meta tags, HTTP headers	Yes	Pages discovered on the target website

Limitations

The Aspider web crawler has the following limitations:

There is no support for websites with dynamic content. For such websites, please refer to the Selenium web crawler.
There is no support for orphan pages, any page cut from the scope because the link pointing to them no longer exist, will be removed once the conditions are met.
Selenium based authentication requires the corresponding web browser to be installed in the server where the crawl will be executed, as well the corresponding web driver to be downloaded according to the browser version.

Page tree

Introduction

Framework and Connector Features

Framework Features

Web Crawler Features

Content Crawled

Limitations

Contact Us: [email protected]

Page tree

Aspider Web Crawler - Features

Introduction

Framework and Connector Features

Framework Features

Web Crawler Features

Content Crawled

Limitations

Contact Us: [email protected]