<bean id="aspireProcessor" class="com.searchtechnologies.aspire.components.heritrixconnector.AspireHeritrixProcessor"/> <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain"> <property name="processors"> <list> <ref bean="aspireProcessor"/> <ref bean="candidates"/> <ref bean="disposition"/> </list> </property> </bean>
The Heritrix Connector Assume that the Heritrix Engine compute an unique digest for each URL crawled, so the following bean must be configured:
<bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp"> <property name="digestContent" value="true" /> <property name="digestAlgorithm" value="md5" /> </bean>
This approach works on both https and http. In this case we are trying to crawl https://myAuthenticatedSite.com/. The configuration would be as follow on the crawler beans:
<bean id="credential" class="org.archive.modules.credential.HttpAuthenticationCredential"> <property name="domain"> <value>myAuthenticatedSite.com:443</value> </property> <property name="realm"> <value>My Authenticated Site Web Browsing</value> </property> <property name="login"> <value>mylogin</value> </property> <property name="password"> <value>myPassword</value> </property> <property name="usingNtlm"> <value>false</value> </property> </bean> <!-- CREDENTIAL STORE: HTTP authentication or FORM POST credentials --> <bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> <property name="credentials"> <map> <entry key="example-credential" value-ref="credential"/> </map> </property> </bean>
Remember to customize the fetchHttp as explained at the bottom.
Realm: The realm string must exactly match the realm name presented in the authentication challenge served by the web server. This is obtained forcing a 401 response from the server. Using curl: curl -ik --user mylogin https://myAuthenticatedSite.com/ The console will prompt for a password so you give an incorrect pass and you will obtain something very similar to this:
$ curl -ik --user mylogin https://myAuthenticatedSite.com/ Enter host password for user 'mylogin': HTTP/1.1 401 Authorization Required Date: Mon, 03 Jun 2013 19:36:05 GMT Server: Apache/2.2.16 (Debian) WWW-Authenticate: Basic realm="My Authenticated Site Web Browsing" Vary: Accept-Encoding Content-Length: 501 Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>401 Authorization Required</title> </head><body> <h1>Authorization Required</h1> <p>This server could not verify that you are authorized to access the document requested. Either you supplied the wrong credentials (e.g., bad password), or your browser doesn't understand how to supply the credentials required.</p> <hr> <address>Apache/2.2.16 (Debian) Server at myAuthenticatedSite.com Port 443</address> </body></html>
The realm should be the one in double quotes on this line WWW-Authenticate: Basic realm="My Authenticated Site Web Browsing"
Aspire Heritrix Connector uses a custom Heritrix Engine that was improved in order to handle NTLM authentication.
Example bean configuration
<bean id="credential" class="org.archive.modules.credential.HttpAuthenticationCredential"> <property name="domain"> <value>myNtlmSite.domain</value> <!- if your site uses https your domain will look like this: myNtlmSite.domain:443 --> </property> <property name="realm"> <value>My Authenticated Site Web Browsing</value> <!-- If no realm is used by NTLM leave empty --> </property> <property name="ntlmDomain"> <value>myNtlmDomain</value> <!-- If NTLM requires users to specify a domain, otherwise leave empty --> </property> <property name="ntlmRealm"> <value>My Authenticated Site Web Browsing</value> <!-- If no realm used by NTLM leave empty --> </property> <property name="login"> <value>myUsername</value> </property> <property name="password"> <value>myPassword</value> </property> <property name="usingNtlm"> <value>true</value> </property> </bean>
The realm values can be retrieved from a curl command execution. See previous section for more information.
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> <property name="credentials"> <map> <entry key="example-credential" value-ref="credential"/> </map> </property> </bean> <bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp"> <property name="credentialStore"> <ref bean="credentialStore"/> </property> <property name="digestContent" value="true" /> <property name="digestAlgorithm" value="md5" /> <property name="sendConnectionClose" value="false" /> </bean>
Sometimes the servers emit cookies with an expiration time, so if you want to force a reconnection you can do two things:
Example:
<bean id="credential" class="org.archive.modules.credential.HtmlFormCredential"> <property name="expireAfter" value="1500"/> <property name="expiredContent" value="window.location = "signup.php"" /> </bean>
Enable XSLT in Heritrix to allow the engine to extract links and content from the XSLT generated HTML. Also the Aspire Heritrix Connector will use the generated HTML to extract the data to be indexed.
To enable XSL transformations add the following property to the FetchHTTP bean
<bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp"> <property name="enableXslt" value="true" /> </bean>
By default if the property enableXslt is not present, Heritrix will not perform any XSL transformation and will process every xml document as it is.
Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. If your site contains pages with more than 6000 links per URL, you would want to configure the maximum of links to extract. To do so you have to add the following property to the org.archive.crawler.frontier.BdbFrontier bean:
<property name="maxOutlinks" value="10000"/>
so your bean would look like this:
<bean class="org.archive.crawler.frontier.BdbFrontier" id="frontier"> <property name="retryDelaySeconds" value="20"/> <property name="maxRetries" value="5"/> <property name="maxOutlinks" value="10000"/> </bean>
If you are crawling a web site with large lists of links and a "Next Page" link, you also would like to increase the maximum Hops to do. For example if your web site consists of the following pages of a list of lninks: page1 -> page2 -> ... -> page100, and your max hops is configured as 5, you would get up to the 6th page, so you would want to increase it to at lease 100.
You can configure the max hops like this:
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> <property name="maxHops" value="100"/> </bean>
By default heritrix uses only one queue per hostname so only one connection would be retrieving data from the same web server. This makes sense on very wide web crawls, but not if you want to crawl a single web site. You can use the Heritrix parallelQueues to have more than one concurrent connection retrieving the data.
There are several ways to distribute the URLs among the parallel queues:
<bean class="org.archive.crawler.frontier.BdbFrontier" id="frontier"> <property name="retryDelaySeconds" value="2"/> <property name="maxRetries" value="10"/> </bean> <bean id="queueAssignmentPolicy" class="org.archive.crawler.frontier.HashingQueueAssignmentPolicy"> <property name="parallelQueues" value="50" /> </bean>
The maxRetries property must be raised if you increase the number of parallelQueues as sometimes it fails at the beginning of the crawl.
Remember to be cautious about how many connections you will use per hostname as that can cause problems to the crawled web site, and it can be considered to be an attack by the web admin.