One of the most common cause of confusion when using the Aspire Heritrix Connector is to detect where the issues should be fixed: is it an Aspire connector issue? or is it a Heritrix Crawl engine issue?.
You can detect which part is the problem by looking at what do each part does:
Consider that Heritrix always try to protect the servers it is crawling, by throttling the requests to the same hostname with the maxDelays.
Try with a lower millisecondsPerRequest if you are sending the configuration via an Aspire Job, or the following property:
<bean class="org.archive.crawler.postprocessor.DispositionProcessor" id="disposition">
<property name="maxDelayMs" value="3000"/>
</bean>
You can also increase the number of parallel connections to the same hostname. See more information at Using a Custom Heritrix Configuration File at the Configuring Concurrent Connections to the same hostname section
More crawler beans configuration at: https://webarchive.jira.com/wiki/display/Heritrix/Basic+Crawl+Job+Settings .
The Heritrix Connector performs incremental crawls based on a disk-backed HashMap, which have the exact documents that have been indexed by the connector to the search engine associated with a content digest signature. On an incremental crawl the connector fully crawls the web sites configured the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.
Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. You can configure that by changing a bean inside a custom heritrix crawler beans. See more information on how to configure that at Using a Custom Heritrix Configuration File
For more information on configuring the crawler beans for custom features see Using a Custom Heritrix Configuration File.
For general FAQ of Heritrix go to http://crawler.archive.org/faq.html .
If you are interested in developing new features in Heritrix go tohttp://crawler.archive.org/articles/developer_manual/index.html .
To request new features or more information please contact us at http://www.searchtechnologies.com/contacts.html .