If you are migrating from Heritrix into Aspider, there are some general steps you should follow to cover all mayor configuration differences between the two crawlers.

In each case the Aspider configuration options are followed in order and details on how to determine what to configure there are provided based on the type of Heritrix configuration you have.


When using standard Aspire configuration for Heritrix

This section is under construction

When using a custom crawler-beans file for Heritrix

  1. Seed URLs

  2. Crawl Scope

      • Set this option to Host Only

  3. User Agent

  4. Crawl Depth

  5. Max Links per page

  6. Max content size (in bytes)
  7. Case Sensitivity URLs
  8. Deletes Policy
  9. Customize Connection Timeouts
  10. Connection Throttling
  11. Obey Robots.txt & Obey Robots Meta Tags
  12. Trust All Certificates
  13. Use Proxy
  14.  Use Authentication
    If your site requires authentication the very first step you need to do is to identify the type of authentication to use.This can be found in the crawler-beans file by looking for the bean with id of "credential".There can be two different beans:

    1. HtmlFormCredential
      Used for cookie based authentication, generally using a login page and POST requests to authenticate.
      If this is the authentication used in your crawler-beans file, you should add a "Cookie Based (HTML Forms)" mechanism in Aspider.
      1. In the Login URL field you should copy the address of the Login Page of your site
      2. In the Form Element Path field you should inspect into your login page structure to determine where the form can be found, for example, if your login page HTML looks like this:

        <html>
        <head>..</head>
        <body>
          <div id="content">
            <form id="login" method="POST" action="login.php">
              <label><b>Username</b></label>
              <input type="text" placeholder="Enter Username" name="uname" required> 
              <label><b>Password</b></label>  
              <input type="password" placeholder="Enter Password" name="psw" required>
              <input type="hidden" name="clientToken" value="wAAAMLCwkJCQgAAAGJiYoKCgpKSkiH">
              <button type="submit">Login</button>
            </form>
          </div>
        </body>
        </html>

        your Form Element Path should look like:

        /html/body/div[@id="content"]/form[@id="login"]

        if you are using 3.1.0.6 version or later, a CSS Selector should be use and it should look like:

        #login


    2. HttpAuthenticationCredential
      There can be different types of authentication such as BASIC, DIGEST or NTLM. These kinds of authentication mechanisms work on the connection level, so the server returns a 401 status code challenging the client for credentials. The crawler beans mapping would go like:

      Crawler beans propertyAspider field
      domainhost
      realmrealm
      logindomain + user
      passwordpassword


  15.  Include patterns

  16. Exclude patterns

  17.  Scan Excluded Items

  18. Reject Images / Videos / Javascript /CSS