Page History

Aspire's OCR solution uses Tesseract to allow content acquired by other connectors to be processed for Optical Character Recognition (OCR). It is designed to be a background process, using the Staging Repository.

When you're using the Staging Repository (but not the OCR solution), content is gathered by any connector and published to the Staging Repository. Then, you crawl the Staging Repository and publish the content to your chosen search engine.

The OCR solution crawls the Staging Repository, processes the content using Tesseract and then publishes the content (which now has the text) back in to the Staging Repository (but with a different owner so you still have the original if required). Once the content is back in the Staging Repository, your next crawl of this can publish the text content to the search engine.

Aspire and Hadoop requirements

Configuration Tutorial

Step by step tutorial to configure and run your first OCR solution

Administration FAQ

Questions and answers, including troubleshooting techniques for Administrators

Page tree

Versions Compared

Old Version 1

New Version Current

Key