Page History

...

This is normally a DNS issue. Check with your network team to see if the site URL(s) you want to crawl are reachable from the server running Aspire.

Where can I learn more about Heritrix

...

architecture and

...

configuration?

You can go to http://crawler.archive.org/Mohr-et-al-2004.pdf paper to have a better overview of the Heritrix architecture.

...

The Heritrix Connector is only encharged in charge of:
- Preparing the crawler-beans.cxml file with the user's parameters.
- Create a Heritrix Engine Job and start the crawl.
- Receive the crawled Web Pages or documents from the Heritrix Crawl Engine and send them to an Aspire Pipeline.
- Manage the incremental indexing (ignore unchanged documents, send new documents and delete the ones that are no longer accessible) from the documents received from the Heritrix Crawl Engine.
- Cleanup content of each document using the Cleanup Regex.
- Apply the Index include/exclude patterns to the documents received.

The Heritrix Crawl engine is encharged in charge of:
- Actually perform the crawl.
- Check for robots policies.
- Perform authentication (including NTLM in our custom engine) if specified and required.
- XSLT transformation (in out custom engine).
- Fetch an input stream for each document.
- Calculate an MD5 digest of the content used later by the Aspire Heritrix Connector to do the incremental indexing.
- Apply the Crawl patterns specified by the user.

...

More crawler beans configuration at: https://webarchive.jira.com/wiki/display/Heritrix/Basic+Crawl+Job+Settings .

Why

...

does an incremental crawl last as long as a full crawl?

The Heritrix Connector performs incremental crawls based on a disk-backed HashMap, which have the exact documents that have been indexed by the connector to the search engine associated with a content digest signature. On an incremental crawl the connector fully crawls the web sites configured the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.

...

Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. You can configure that by changing a bean inside a custom heritrix Heritrix crawler beans. See more information on how to configure that at Using a Custom Heritrix Configuration File

...

To request new features or more information please contact us at http://www.searchtechnologies.com/contacts.html this email address.

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Where can I learn more about Heritrix

architecture and

configuration?

Why

does an incremental crawl last as long as a full crawl?