Information on the Aspire OCR Solution which is a ready to use solution that uses Aspire to OCR data.

Aspire's OCR solution uses Tesseract to allow content acquired by other connectors to be processed for Optical Character Recognition (OCR). It is designed to be a background process, using the Staging Repository

When you're using the Staging Repository (but not the OCR solution), content is gathered by any connector and published to the Staging Repository. Then, you crawl the Staging Repository and publish the content to your chosen search engine.

The OCR solution crawls the Staging Repository, processes the content using Tesseract and then publishes the content (which now has the text) back in to the Staging Repository (but with a different owner so you still have the original if required). Once the content is back in the Staging Repository, your next crawl of this can publish the text content to the search engine.

  • No labels