If you cannot find what you need in this page, please check the connectors general troubleshooting page.
You can enable the crawl logs by checking the "Log crawled URLs" option in the Connection section.
The crawl logs look like this:
INFO [/<connectorId>-aspider/RAP] 200 890 http://localhost:8000/site1/page1.html seed text/html INFO [/<connectorId>-aspider/RAP] 200 181 http://localhost:8000/site1/pages1/pageC.html http://localhost:8000/site1/page1.html text/html
The format on which it is logged is this:
[HTTP Status Code] [ContentSize] [URL] [parent URL] [Content Type]
There are two types of authentication: cookie-based authentication and HTTP-based authentication.
You can figure this out by executing a curl command over the seed URL:
curl -i http://mysitehost/mysite
Check out the response for the WWW-Authenticate header, if any. It specifies what type of authentication mechanism you need and the realm of it, if required.
Aspider supports the following mechanisms:
Some sites have two or more "WWW-Authenticate" headers in their response. The first one corresponds to the preferred schema, and the others might be fail-over. This is done because some browsers don't support the preferred authentication schema. For Aspider, you can use either one of the mechanisms if it is a supported schema (mentioned above). Please verify with the appropriate IT department to verify which schema must be used.
Some schemas require a realm to work. If you see a realm inside the response headers, then use that in the configuration.
Try your credentials with curl:
curl -I --<basic/digest/ntlm> -u <username> <the-seed-url>
The command will prompt you for the password. If the response header displays as shown below, you have the correct credentials.
HTTP/1.1 200 OK Content-Length: <XXX> Content-Type: text/html Last-Modified: <XXXXXXXXXXXXXXXXX>
If you want to use the Negotiate/Kerberos authentication scheme, then you need to find the "Key Distribution Center" (KDC). This is a service that supplies session tickets and temporary session keys to users and computers within an Active Directory domain. If you don't know your KDC address, do as follows:
nltest /dsgetdc:<domain.name>
The KDC address will appear in the first line as "DC: <the KDC address>"
cat /etc/krb5.conf
Look for something like:
[realms] TESTDOM.LAN = { kdc = DC1.TESTDOM.LAN admin_server = DC1.TESTDOM.LAN }
If your site requires this kind of authentication, then you need to know certain details about it in order to configure Aspider to crawl it.
The login form page is where the login form is located, some sites redirect you here if you are not authenticated, in that case, if you execute the curl command from above, it will most likely be the URL specified in the "Location" response Header.
For example, if your login form page consists of the following HTML, then your path would be "div > form". You can use Google Chrome inspect mode to generate this CSS Selector for you.
<html> <head>....</head> <body> <div ....> <form method="post" action="...." ... ... </form> </div> </body> </html>
Also identify the name attribute of the user and password input elements.
Some connectors perform incremental crawls based on snapshot entries, which are meant to match the exact documents that have been indexed by the connector to the search engine. On an incremental crawl, the connector fully crawls the repository the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.
For a discussion on crawling, see Full & Incremental Crawls.
Sometimes, just the username and password fields are not enough to authenticate to a site. Some sites require some custom fields or even the "submit" button in order to successfully authenticate you. So you may have to add them as custom fields in the Aspider Configuration.
You can also open the browser inspect mode in order to break down the authentication request and make sure you are not missing any field.
Don't worry about those hidden fields inside the form, Aspider will automatically include them in the request, you don't have to do anything.
Watch out for "logout" pages, which usually send requests to the browsers to clear their cookies. If Aspider is requested to clear its cookies for logging out, it will do that and will not try to re-login.
Suggestion: Add an exclusion pattern for the "log out" pages.
Unfortunately not at the moment, Aspider Cookie Based Authentication is built to send only one request, but we are already considering improvements for it.
If your login form relies on JavaScript for sending the Authentication request, your form element probably won't have an "action" attribute, which Aspider use for sending the POST request to. So you wouldn't be able to authenticate.
As of Java 8, Aspider supports and has been tested to work on the following protocols:
Note: SSLv2 and SSLv3 are not supported by Aspider.
The Aspide Web Crawler connector does not have any major consideration regarding throttling.