Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Tesseract OCR component can be configured using the Aspire Admin UI. It requires the following entities to be created:

  • Connector
  • Seed

This component is an application for workflow configuration, and it is used in the "onAddUpdate" Workflow Event.

Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Workflow


  1. On the Aspire Admin UI, go to the Workflows main page
  2. All existing workflows will be listed. Select one or click on the new button image2021-12-7_7-37-8.png
    1. Enter the new workflow description. 
    2. Select the "Create" button.
  3. Go to the Workflow Event "onAddUpdate".
  4. Search in “Type criteria” (in the Application's options) and drag, using the symbol () next to the Tesseract OCR component in the onAddUpdate section.
  5. Enter a new description for this application component.
  6. OCR Settings:
    1. Tesseract Path: Path to tesseract binary file.
    2. Tesseract timeout: Select the time (in milliseconds) to wait before killing a OCR process.
    3. Temporary store directory:Select the directory used to store the temporary files generated during the OCR process.
    4. Image Size-limit for Image Correction: Apply image correction only for those images that fall under this size. (i.e., 250kb, 5mb, 1gb).
    5. Confidence threshold: Select the minimum confidence value to accept the OCR output.
    6. Image creation settings:Configure the Image Creation process with the following settings.
      1. Image format: Select the image format for the temporary files generated during the OCR process.
      2. Image color scale: Select the color format for the temporary files generated during the OCR process.
      3. Image DPI: Select the image dpi for the temporary files generated during the OCR process.
  7. Mime type Settings:
    1. Mime Type XPath: Specify the XPath expression to get the document Mime type.
    2. PDF Mime Types: Specify the Mime type for PDF documents that will be accepted for the OCR process.
    3. Image Mime Types: Specify the Mime type for Image documents that will be accepted for the OCR process.
  8. Page splitter settings:
    1. Start page: Select the page to start processing with OCR. If the value is 0 will start from the first page.
    2. End page: Select the last page to process with OCR.
  9. Advanced settings:
    1. Process threads: Set the max number of threads used by the application.
    2. Process Queue size: Set the size of application process queue, should be at least 3 times the process threads.
    3. Queue back off time: Set the time (in milliseconds) to wait before trying to add a job to the queue when it is full.
    4. Debug: Option if you want debug messages enabled.