If you are migrating from Heritrix into Aspider, there are some general steps you should follow to cover all mayor configuration differences between the two crawlers.
In each case the Aspider configuration options are followed in order and details on how to determine what to configure there are provided based on the type of Heritrix configuration you have.
This section is under construction
Seed URLs
Copy seed URLs into Aspider
Crawl Scope
If the crawler beans contains a org.archive.modules.deciderules.surt.NotOnDomainsDecideRule bean
Set this option to Domain only
If contains a org.archive.modules.deciderules.surt.NotOnHostsDecideRule bean instead
Set this option to Host Only
If there is not any of the previous beans
User Agent
If any userAgentTemplate property is set in the crawler beans
Copy the user-agent into this option
Crawl Depth
This is as a bean in the crawler beans:
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> <property name="maxHops" value="HOPS-NUMBER" /> </bean>
Copy the "HOPS-NUMBER" into this option
Max Links per page
Find the maxOutLinks property in the crawler beans and copy its value into this option
Use Authentication
If your site requires authentication the very first step you need to do is to identify the type of authentication to use.This can be found in the crawler-beans file by looking for the bean with id of "credential".There can be two different beans:
In the Form Element Path field you should inspect into your login page structure to determine where the form can be found, for example, if your login page HTML looks like this:
<html> <head>..</head> <body> <div id="content"> <form id="login" method="POST" action="login.php"> <label><b>Username</b></label> <input type="text" placeholder="Enter Username" name="uname" required> <label><b>Password</b></label> <input type="password" placeholder="Enter Password" name="psw" required> <input type="hidden" name="clientToken" value="wAAAMLCwkJCQgAAAGJiYoKCgpKSkiH"> <button type="submit">Login</button> </form> </div> </body> </html>
your Form Element Path should look like:
/html/body/div[@id="content"]/form[@id="login"]
if you are using 3.1.0.6 version or later, a CSS Selector should be use and it should look like:
#login
HttpAuthenticationCredential
There can be different types of authentication such as BASIC, DIGEST or NTLM. These kinds of authentication mechanisms work on the connection level, so the server returns a 401 status code challenging the client for credentials. The crawler beans mapping would go like:
Crawler beans property | Aspider field |
---|---|
domain | host |
realm | realm |
login | domain + user |
password | password |
Include patterns
Find the "MatchesListRegexDecideRule" which have the "decision" property of "ACCEPT"
Copy the regex patterns into the "Include Patterns" section in Aspider
Any URL that does not match with any pattern in this list will be EXCLUDED
Any pattern set here will overwrite the "Crawl Scope"
For example adding .* is like having Crawl Scope "Everything"
If you add a pattern here that doesn't match the seed URLs the crawler won't be able to get anything, so make sure your seed URLs are covered
Exclude patterns
Find the"MatchesListRegexDecideRule" which have the "decision" property of "REJECT"
Copy the regex patterns into the "Exclude patterns" section in Aspider
Any URL matching a pattern in this list will NOT be processed in the workflow so it will NOT be indexed.
These rules will also prevent the crawler from discovering links from any URL that matches a pattern in this list (unless "scan excluded items" is checked).
Scan Excluded Items
This option will allow to extract (scan) links also from excluded items.
If there is a pattern in the Heritrix "Index Exclude Patterns" but not in the crawler beans REJECT MatchesListRegexDecideRule, you may want to check this option
If there is a subset of pattern of rejects that you don't want to extract links (scan) you should add those patterns into the "Do not follow patterns" option
Reject Images / Videos / Javascript /CSS
If you have the following bean in your crawler beans
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <property name="listLogicalOr" value="true"/> <property name="regexList"> <list> <value>.*\.js.*</value> <value>.*\.css.*</value> <value>.*\.swf.*</value> <value>.*\.gif.*</value> <value>.*\.png.*</value> <value>.*\.jpg.*</value> <value>.*\.jpeg.*</value> <value>.*\.bmp.*</value> <value>.*\.mp3.*</value> <value>.*\.mp4.*</value> <value>.*\.avi.*</value> <value>.*\.mpg.*</value> <value>.*\.mpeg.*</value> </list> </property> </bean>
Then you should check this option in Aspider
This will exclude any multimedia files from getting processed by the crawler.