Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • This is normally a DNS issue. Check with your network team to see if the site URL(s) you want to crawl are reachable from the server running Aspire.

Where can I learn more about Heritrix

...

architecture and

...

configuration?

...

  • The Heritrix Connector is only encharged in charge of:
    • Preparing the crawler-beans.cxml file with the user's parameters.
    • Create a Heritrix Engine Job and start the crawl.
    • Receive the crawled Web Pages or documents from the Heritrix Crawl Engine and send them to an Aspire Pipeline.
    • Manage the incremental indexing (ignore unchanged documents, send new documents and delete the ones that are no longer accessible) from the documents received from the Heritrix Crawl Engine.
    • Cleanup content of each document using the Cleanup Regex.
    • Apply the Index include/exclude patterns to the documents received.
  • The Heritrix Crawl engine is encharged in charge of:
    • Actually perform the crawl.
    • Check for robots policies.
    • Perform authentication (including NTLM in our custom engine) if specified and required.
    • XSLT transformation (in out custom engine).
    • Fetch an input stream for each document.
    • Calculate an MD5 digest of the content used later by the Aspire Heritrix Connector to do the incremental indexing.
    • Apply the Crawl patterns specified by the user.

...

More crawler beans configuration at: https://webarchive.jira.com/wiki/display/Heritrix/Basic+Crawl+Job+Settings .

Why

...

does an incremental crawl last as long as a full crawl?

The Heritrix Connector performs incremental crawls based on a disk-backed HashMap, which have the exact documents that have been indexed by the connector to the search engine associated with a content digest signature. On an incremental crawl the connector fully crawls the web sites configured the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.

...

Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. You can configure that by changing a bean inside a custom heritrix Heritrix crawler beans. See more information on how to configure that at Using a Custom Heritrix Configuration File

...

To request new features or more information please contact us at http://www.searchtechnologies.com/contacts.html this email address.