Crawl Troubleshooting


If you cannot find what you need in this page, please check the connectors general troubleshooting page.


How can I trace a crawl?

You can enable the crawl logs by checking the "Log crawled URLs" option in the Connection section.

The crawl logs look like this:

INFO [/<connectorId>-aspider/RAP] 200   890    http://localhost:8000/site1/page1.html    seed    text/html
INFO [/<connectorId>-aspider/RAP] 200   181    http://localhost:8000/site1/pages1/pageC.html    http://localhost:8000/site1/page1.html    text/html

The format on which it is logged is this:


[HTTP Status Code]   [ContentSize]    [URL]    [parent URL]    [Content Type]


Which Authentication Mechanism should be used?

There are two types of authentication:  cookie-based authentication and HTTP-based authentication.

  1. Cookie-based are the ones that redirect you to a login page for you to login.
    You are probably facing a cookie-based authenticated site if the first line of the result is a redirection such as "HTTP/1.1 302 Found".
     
  2. HTTP based are the ones that prompt you for credentials in the browser without even displaying a login page.
    You are probably facing an HTTP based authentication issue if the first line of the result is "HTTP/1.1 401 Unauthorized".
     

You can figure this out by executing a curl command over the seed URL:


curl -i http://mysitehost/mysite

Check out the response for the WWW-Authenticate header, if any. It specifies what type of authentication mechanism you need and the realm of it, if required.

Aspider supports the following mechanisms:

  • Basic
  • Digest
  • NTLM
  • Negotiate/Kerberos


Some sites have two or more "WWW-Authenticate" headers in their response. The first one corresponds to the preferred schema, and the others might be fail-over. This is done because some browsers don't support the preferred authentication schema. For Aspider, you can use either one of the mechanisms if it is a supported schema (mentioned above). Please verify with the appropriate IT department to verify which schema must be used.

Some schemas require a realm to work. If you see a realm inside the response headers, then use that in the configuration.

Basic/Digest/NTLM

Try your credentials with curl:

curl -I --<basic/digest/ntlm> -u <username> <the-seed-url>

The command will prompt you for the password. If the response header displays as shown below, you have the correct credentials.

HTTP/1.1 200 OK
Content-Length: <XXX>
Content-Type: text/html
Last-Modified: <XXXXXXXXXXXXXXXXX>

Negotiate/Kerberos

If you want to use the Negotiate/Kerberos authentication scheme, then you need to find the "Key Distribution Center" (KDC). This is a service that supplies session tickets and temporary session keys to users and computers within an Active Directory domain. If you don't know your KDC address, do as follows:

  • On Windows CMD or Powershell
nltest /dsgetdc:<domain.name>

The KDC address will appear in the first line as "DC: <the KDC address>"

  • On Linux bash
cat /etc/krb5.conf

Look for something like:

[realms]
TESTDOM.LAN = {
    kdc = DC1.TESTDOM.LAN
    admin_server = DC1.TESTDOM.LAN
}

If your site requires this kind of authentication, then you need to know certain details about it in order to configure Aspider to crawl it.

  1. The login form page is where the login form is located, some sites redirect you here if you are not authenticated, in that case, if you execute the curl command from above, it will most likely be the URL specified in the "Location" response Header.

  2. The login form HTML structure is the path for where to find the HTML form that Aspider needs (in order to fill and send) to authenticate.
    1. For example, if your login form page consists of the following HTML, then your path would be "div > form". You can use Google Chrome inspect mode to generate this CSS Selector for you.

      <html>
      <head>....</head>
      <body>
          <div ....>
          <form method="post" action="...." ...
             ...
          </form>
          </div>
      </body>
      </html>
    2. Also identify the name attribute of the user and password input elements.


Why does an incremental crawl last as long as a full crawl?

Some connectors perform incremental crawls based on snapshot entries, which are meant to match the exact documents that have been indexed by the connector to the search engine. On an incremental crawl, the connector fully crawls the repository the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.

For a discussion on crawling, see Full & Incremental Crawls.

Authentication Issues

(Cookie Based) I have added the "username" and "password" fields, but I can't still authenticate.

Sometimes, just the username and password fields are not enough to authenticate to a site. Some sites require some custom fields or even the "submit" button in order to successfully authenticate you. So you may have to add them as custom fields in the Aspider Configuration.

You can also open the browser inspect mode in order to break down the authentication request and make sure you are not missing any field.

(Cookie Based) How to include those dynamic "hidden" elements into the Aspider authentication request?

Don't worry about those hidden fields inside the form, Aspider will automatically include them in the request, you don't have to do anything.

(Cookie Based) My initial login is successful but shortly after, Aspider can't connect to already discovered URLs.

Watch out for "logout" pages, which usually send requests to the browsers to clear their cookies. If Aspider is requested to clear its cookies for logging out, it will do that and will not try to re-login.

Suggestion: Add an exclusion pattern for the "log out" pages.

  • If "Scan Excluded Items" is selected, make sure the "Do not follow patterns" also contains a pattern that matches the "log out" page.

(Cookie Based) My site requires two different HTTP requests in order to authenticate, can Aspider handle that?

Unfortunately not at the moment, Aspider Cookie Based Authentication is built to send only one request, but we are already considering improvements for it.

(Cookie Based) Can Aspider handle JavaScript based authentications?

If your login form relies on JavaScript for sending the Authentication request, your form element probably won't have an "action" attribute, which Aspider use for sending the POST request to. So you wouldn't be able to authenticate.

(Cookie Based) Which versions of SSL are supported by the Aspider web crawler?

As of Java 8, Aspider supports and has been tested to work on the following protocols:

  • TLSv1
  • TLSv1.1
  • TLSv1.2

Note: SSLv2 and SSLv3 are not supported by Aspider.


Throttling Considerations


The Aspide Web Crawler connector does not have any major consideration regarding throttling. 

  • No labels