The Heritrix Connector Assume that the Heritrix Engine compute an unique digest for each URL crawled, so the following bean must be configured:
This approach works on both https and http. In this case we are trying to crawl https://myAuthenticatedSite.com/. The configuration would be as follow on the crawler beans:
Remember to customize the fetchHttp as explained at the bottom.
Realm: The realm string must exactly match the realm name presented in the authentication challenge served by the web server. This is obtained forcing a 401 response from the server. Using curl: curl -ik --user mylogin https://myAuthenticatedSite.com/ The console will prompt for a password so you give an incorrect pass and you will obtain something very similar to this:
The realm should be the one in double quotes on this line WWW-Authenticate: Basic realm="My Authenticated Site Web Browsing"
Aspire Heritrix Connector uses a custom Heritrix Engine that was improved in order to handle NTLM authentication.
Example bean configuration
The realm values can be retrieved from a curl command execution. See previous section for more information.
HtmlFormCredential reconnection on expired cookies
Sometimes the servers emit cookies with an expiration time, so if you want to force a reconnection you can do two things:
- Force a reconnection after a period of time (expireAfter)
- Force a reconnection for a URL when its content contains a regex (expiredContent)
Enable XSL Transformation
Enable XSLT in Heritrix to allow the engine to extract links and content from the XSLT generated HTML. Also the Aspire Heritrix Connector will use the generated HTML to extract the data to be indexed.
To enable XSL transformations add the following property to the FetchHTTP bean
By default if the property enableXslt is not present, Heritrix will not perform any XSL transformation and will process every xml document as it is.
Crawling Large Web Pages
Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. If your site contains pages with more than 6000 links per URL, you would want to configure the maximum of links to extract. To do so you have to add the following property to the org.archive.crawler.frontier.BdbFrontier bean:
so your bean would look like this:
If you are crawling a web site with large lists of links and a "Next Page" link, you also would like to increase the maximum Hops to do. For example if your web site consists of the following pages of a list of lninks: page1 -> page2 -> ... -> page100, and your max hops is configured as 5, you would get up to the 6th page, so you would want to increase it to at lease 100.
You can configure the max hops like this:
Configuring Concurrent Connections to the same hostname
By default heritrix uses only one queue per hostname so only one connection would be retrieving data from the same web server. This makes sense on very wide web crawls, but not if you want to crawl a single web site. You can use the Heritrix parallelQueues to have more than one concurrent connection retrieving the data.
There are several ways to distribute the URLs among the parallel queues:
- This will group the URLs based on its IP addresses.
- This will group the URLs based on the hostname:port evident in the URL
- This will group the URLs based on its IP addresses, if there is no IP address available it behaves as the HostnameQueueAssignmentPolicy
- This will group the URLs evenly based on this formula: urlHash % parallelQueuesSize.
The maxRetries property must be raised if you increase the number of parallelQueues as sometimes it fails at the beginning of the crawl.
Remember to be cautious about how many connections you will use per hostname as that can cause problems to the crawled web site, and it can be considered to be an attack by the web admin.