How it Works

The Failed Documents Handling feature is in charge of detecting and re-processing any documents that fail during crawling, content processing or publishing. The failed documents detection can be set up for exceptions thrown on specific documents or on batch errors when publishing to a search engine.

Reprocessing a document

Retrying on the same crawl

If a document or batch fails during a crawl, it will be reprocessed at the end of the current crawl, after the deletes are sent. The number of times a document will be retried during the same crawl is configured using the "Maximum retries per crawl" option.

Aspire 3.3 (Willow) > Failed Documents Handling > MaxRetries.JPG

Retrying on the next incremental crawl

When running an incremental crawl, the connector will first check for any remaining failed documents and retry those before starting the crawl. Any document or batch that fails during this check and during the incremental crawl, will be retried at the end of the crawl after the deletes are sent. The number of incremental crawls that a document should be reprocessed is configured using the "Maximum crawls to retry" option.

Aspire 3.3 (Willow) > Failed Documents Handling > MaxCrawls.JPG

Reprocessing documents is not done in batches, which means that if a batch fails because of a single document error, the reprocessing should be able to process all documents and keep retrying the failing ones.

Failure Detection

To detect document or batch failures, the component receives a list of regex patterns to match with the errors thrown by the documents or batches. If any of those patterns matches an error message, the document will be marked as failed and retried on a later stage. In the case of a pattern matching a batch error, all documents that are part of the batch will be marked as failed. These patterns are set up in the "Exception Patterns to retry" section.

Aspire 3.3 (Willow) > Failed Documents Handling > Patterns.JPG

Other Options

Retry all publishing errors

Any error that is thrown during the publishing stage will be reprocessed on the next crawl.

Aspire 3.3 (Willow) > Failed Documents Handling > RetryAll.JPG

Remove Failed Documents from Snapshot

This will remove any failed document from the snapshot at the end of the crawl. This means that the document will be retried on all the following incremental crawls, ignoring the "Maximum crawls to retry" option.

Aspire 3.3 (Willow) > Failed Documents Handling > RemoveSnap.JPG