You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
Version 1
Next »
The Aspider web crawler will crawl content from a website.
Introduction
The Aspider web crawler will crawl content from any given website.
It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.
Framework and Connector Features
Framework Features
Name | Supported |
---|
Content Crawling | yes |
Identity Crawling | no |
Snapshot-based Incrementals | yes |
Non-snapshot-based Incrementals | no |
Document Hierarchy | no |
Web Crawler Features
The Aspider web crawler has the following features:
- Supports multiple authentication methods:
- Basic
- Digest
- NTLM
- Negotiate/Kerberos
- HTTP Cookies
- Configurable user agent.
- Max crawl depth.
- Flexible crawl scope.
Content Crawled
The Aspider web crawler is able to crawl the following objects:
Name | Type | Relevant Metadata | Content Fetch and Extraction | Description |
---|
Web Page | document | HTML Meta tags, HTTP headers | Yes | Pages discovered on the target website |
Limitations
The Aspider web crawler has the following limitations:
- There is no support for web sites with dynamic content, for such web sites please refer to the Selenium web crawler.
- There is no support for orphan pages, any page cut from the scope because the link pointing to them no longer exist, will be removed once the conditions are met.
- Selenium based authentication requires the corresponding web browser to be installed in the server where the crawl will be executed, as well the corresponding web driver to be downloaded according to the browser version.