Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Elastic Cache Lookup Tesseract OCR component can be configured using the Aspire Admin UI. It requires the following entities to be created:

  • Connector
  • Seed

This component is an application for workflow configuration, and it is used in the "onAddUpdate" Workflow Event.

Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Workflow


  1. On the Aspire Admin UI, go to the connections Workflows main page Image Removed Image Added
  2. All existing connections workflows will be listed. Click Select one or click on the new button image2021-12-7_7-37-8.png
    1. Enter the new workflow description. 
    2. Select the "Create" button.
  3. Go to the Workflow Event "onAddUpdate".
  4. Search in "Type criteria" the Applications options “Type criteria” (in the Application's options) and drag, using Image Removed, the Elastic Cache Lookup the symbol (Image Added) next to the Tesseract OCR component in the onAddUpdate section.
  5. Enter a new description for this application component.
  6. Elasticsearch OCR Settings:
  7. Server URL: Select this to enable basic user authentication.
  8. Authentication: Select this to enable AWS Signature V4 authentication.Basic:Select this to enable basic user authentication.
    1. Username: The name of Elasticsearch user to use.
    2. Password: The password of Elasticsearch user to use.
  9. Amazon Web Services (AWS): Check this to use default AWS credentials.
    1. Region: The Region of the ES service to use, i.e: us-east-1.
    2. Use credentials provider chain: To uses AWS credentials provider chain.
    3. Access Key: The Access key of the ES service to use. 
    4. Secret Key: The Secret key of the ES service to use.
  10. Index: Index name to get the _source content.
  11. Connection Settings:
    1. Connection pool
      1. Idle connection timeout: Time (in milliseconds) to keep an idle connection open.
      2. Max connections: Maximum number of connections to be opened.
      3. Connections per target: Number of connections opened for the same target.
    2. Timeout settings
      1. Connection timeout: Time (in milliseconds) to wait for the connection.
      2. Socket timeout: Time (in milliseconds) to wait for a socket response.
    3. Connection throttling:
      1. Throttling settings
        1. Throttling period: Time period (in milliseconds) to throttle the connection.
        2. Max connections per period: Number of connections used during the throttling period.
    4. Retries:
      1. Maximum retries: Maximum number of retries for a failed document.
      2. Retry delay: Time (in milliseconds) to wait before a retry.
  12. Cache:
    1. Use cache: Results should be cached in memory.
    2. Cache Eviction Policy:
      1. Size
        1. Max number of entries: Max total number of entries to keep in the cache.
      2. Weight
        1. Max total Weight (MB): Specifies the maximum weight of entries the cache must contain.
      3. Time
        1. Time (min): Remove records that have been idle for an amount of time in minutes.
    1. Tesseract Path: Path to tesseract binary file.
    2. Tesseract timeout: Select the time (in milliseconds) to wait before killing a OCR process.
    3. Temporary store directory:Select the directory used to store the temporary files generated during the OCR process.
    4. Image Size-limit for Image Correction: Apply image correction only for those images that fall under this size. (i.e., 250kb, 5mb, 1gb).
    5. Confidence threshold: Select the minimum confidence value to accept the OCR output.
    6. Image creation settings:Configure the Image Creation process with the following settings.
      1. Image format: Select the image format for the temporary files generated during the OCR process.
      2. Image color scale: Select the color format for the temporary files generated during the OCR process.
      3. Image DPI: Select the image dpi for the temporary files generated during the OCR process.
  13. Mime type Settings:
    1. Mime Type XPath: Specify the XPath expression to get the document Mime type.
    2. PDF Mime Types: Specify the Mime type for PDF documents that will be accepted for the OCR process.
    3. Image Mime Types: Specify the Mime type for Image documents that will be accepted for the OCR process.
  14. Page splitter settings:
    1. Start page: Select the page to start processing with OCR. If the value is 0 will start from the first page.
    2. End page: Select the last page to process with OCR.
  15. Advanced settings:
    1. Process threads: Set the max number of threads used by the application.
    2. Process Queue size: Set the size of application process queue, should be at least 3 times the process threads.
    3. Queue back off time: Set the time (in milliseconds) to wait before trying to add a job to the queue when it is full
    Lookup Fields:
    1. Index lookup field: Elastic index field name for the lookup.
    2. Source lookup field: Field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.
    3. Uppercase the source lookup field value: Convert the value of the source field into UPPERCASE value.
    4. Lookup output field: Output fields from the lookup will be placed under this configured object.
    5. Debug: Option if you want debug messages enabled.
    6. Hit Size: Max mount of hits returned by the cache lookup. If -1 all hits will be returned.


Image Added

Image Added
Image Added
Image Added

Image Added

Image Removed

Image Removed
Image Removed
Image RemovedImage Removed