You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

If you want to crawl a web site with Aspider you need to make sure you have the following points covered:

  • The Aspire Servers must have access to the seed(s) URL (configured in the content source configuration).
    • Try pinging the web site server, or requesting the seed URL using curl.

  • Check for any credentials needed to access the sites to be crawled.


How to know which authentication mechanism to use?

There are two types of authentication: the HTTP based authentication and the cookie based authentication.

  • HTTP based are the ones that prompt you for credentials in the browser without even displaying a login page.
  • Cookie based are the ones that redirect you to a login page for you to login. 


The first thing you should do is executing a curl over the seed URL:

$ curl -i http://mysitehost/mysite

If the first line of the result is a redirection for example: "HTTP/1.1 302 Found" It means you are probably facing a cookie based authenticated site.

If the first line of the result is a "HTTP/1.1 401 Unauthorized", you are probably facing with a HTTP based authentication.

Cookie based authentication

If your site requires this kind of authentication, you need to know certain details about it in order to configure aspider to crawl it:

  1. The login form page, this is the place where the site redirects you when you are not authenticated. If you executed the curl command from above, it will most likely be the URL specified in the "Location" response Header.
  2. The login form html structure, this is the path where to find the html form Aspider will need to fill and send to authenticate.
    1. For example if your login form page consists of the following HTML:

      <html>
      <head>....</head>
      <body>
          <div ....>
          <form method="post" action="...." ...
             ...
          </form>
          </div>
      </body>
      </html>

      Your path would be /html/body/div/form

    2. Also identify the id attribute of the user and password input elements

HTTP based authentication

If your site requires this type of authentication you need to determine which authentication scheme to use.

If you executed the curl command from above, you can determine the authentication scheme by looking at the "WWW-Authenticate" headers.

Aspider support the following schemes:

  • Basic
  • Digest
  • NTLM
  • Negotiate
  • Kerberos

Some sites have two "WWW-Authenticate" headers in their response. The first one they correspond to the preferred schema and the second one is used for fail-over, this is done because some browsers don't support the preferred authentication scheme. As Aspider is conserned you can use either one of the two mechanisms if they are one of the supported schemas mentioned above.

  • No labels