Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Information on the Aspire OCR Solution which is a ready to use solution that uses Aspire to OCR data.

  • Overview

Overview of the solution, as well of the related components

  • Prerequisites

Image Removed

Image Added

Aspire's OCR solution uses Tesseract to allow content acquired by other connectors to be processed for Optical Character Recognition (OCR). It is designed to be a background process, using the Staging Repository

When you're using the Staging Repository (but not the OCR solution), content is gathered by any connector and published to the Staging Repository. Then, you crawl the Staging Repository and publish the content to your chosen search engine.

The OCR solution crawls the Staging Repository, processes the content using Tesseract and then publishes the content (which now has the text) back in to the Staging Repository (but with a different owner so you still have the original if required). Once the content is back in the Staging Repository, your next crawl of this can publish the text content to the search engine.

Aspire and Hadoop requirements

  • Configuration Tutorial

Step by step tutorial to configure and run your first OCR solution

  • Administration FAQ
Questions and answers, including troubleshooting techniques for Administrators