After a few minutes, there are no updates submitted in the content source statistics

Where can I learn more about Heritrix architecture and configuration?

Where to look at if my Heritrix Connector is not working as expected?

One of the most common cause of confusion when using the Aspire Heritrix Connector is to detect where the issues should be fixed: is it an Aspire connector issue? or is it a Heritrix Crawl engine issue?.

You can detect which part is the problem by looking at what do each part does:

My crawl is very slow

Consider that Heritrix always try to protect the servers it is crawling, by throttling the requests to the same hostname with the maxDelays.

Try with a lower millisecondsPerRequest if you are sending the configuration via an Aspire Job, or the following property:

<bean class="org.archive.crawler.postprocessor.DispositionProcessor" id="disposition">
  <property name="maxDelayMs" value="3000"/>
</bean>

You can also increase the number of parallel connections to the same hostname. See more information at Using a Custom Heritrix Configuration File at the Configuring Concurrent Connections to the same hostname section

More crawler beans configuration at: https://webarchive.jira.com/wiki/display/Heritrix/Basic+Crawl+Job+Settings .

Why does an incremental crawl last as long as a full crawl?

The Heritrix Connector performs incremental crawls based on a disk-backed HashMap, which have the exact documents that have been indexed by the connector to the search engine associated with a content digest signature. On an incremental crawl the connector fully crawls the web sites configured the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.

Why am I only getting 6000 documents discovered per URL?

Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. You can configure that by changing a bean inside a custom Heritrix crawler beans. See more information on how to configure that at Using a Custom Heritrix Configuration File

More information about Heritrix Connector

For more information on configuring the crawler beans for custom features see Using a Custom Heritrix Configuration File.

For general FAQ of Heritrix go to http://crawler.archive.org/faq.html .

If you are interested in developing new features in Heritrix go tohttp://crawler.archive.org/articles/developer_manual/index.html .

To request new features or more information please contact us at this email address.