The Aspider web crawler will crawl content from a website.
The Aspider web crawler will crawl content from any given website.
It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.
Name | Supported |
---|---|
Content Crawling | yes |
Identity Crawling | no |
Snapshot-based Incrementals | yes |
Non-snapshot-based Incrementals | no |
Document Hierarchy | no |
The Aspider web crawler has the following features:
Allows removing parts of the pages that are not required to be indexed or contains links to pages that the crawler shouldn't follow.
This feature removes parts of the URL that may change overtime without the contents of the document change, for instance access signatures, or session ids.
For instance, if a parent page contains the following link:
But the accessBy parameter changes every 2 hours even if the doc.pdf file does not change. The accessBy parameter can be cleared from the URL used to identify the document so subsequent incremental crawls knows they are in fact the same document, and will verify the document's signature based on its contents.
Seed URLs may point to HTML pages or sitemap XML pages. If a sitemap is used as a seed, the crawler will parse the XML file and detect where it needs to go based on it.
The Aspider web crawler is able to crawl the following objects:
Name | Type | Relevant Metadata | Content Fetch and Extraction | Description |
---|---|---|---|---|
Web Page | document | HTML Meta tags, HTTP headers | Yes | Pages discovered on the target website |
The Aspider web crawler has the following limitations: