Heritrix Prerequisites

If you are to crawl a web site you will need to make sure you have the following points covered:

Is the web site available from the Aspire machine?
You can use tools like curl or any web browser to make sure the Aspire machine have access to the web sites you want.
Has the web site any security access restriction?
If it does, what kind of authentication it has? NTLM, Cookie Based (HTML Forms), Basic, Digest?
Also make sure you have the correct credentials to access it.

Page tree