Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
The Elasticsearch Cache Lookup is a workflow component for Aspire.


Elastic Cache Lookup
Factory Namecom.accenture.aspire:app-ocr-processor
subTypejob-input
InputsThe field from which you want to get the value and a field to be created in the Aspire document.
OutputsAspire object that contains a subjob with metadata and checked out content extracted from the OCR process

Easy Heading Free
navigationTitleOn this Page
navigationExpandOptionexpand-all-by-default

Configuration


This section lists all configuration parameters available to configure the Tesseract OCR component.


ElementTypeDefaultDescription
OCR SettingstesseractPathtext-Tesseract binary location
processTimeoutnumber600000Time (in milliseconds) to wait before killing a tesseract process
imageDirectorytext-Directory used to store the temporary files generated during OCR
maxSizetext10mbApply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)
confidenceThresholdnumber80.0Minimum confidence value to accept the ocr output
Image creation settingsoutputFormatselectjpgImage format (jpg, png, tiff)
imageTypeselectbilevelImage color scale (bilevel, gray, rgba, rgb)
dpinumber300Image dots per inch
Mime Type settingsmimeTypeXPathtext/doc/mimeTypeXpath expression to get the document Mime type
pdfMimeTypesarray-Mime type for PDF documents
imageMimeTypesarray-Mime type for image documents
Page splitter settingsstartPagenumber0Page to start processing with OCR. If value is 0 will start from the first page
endPagenumber20Last page to process with OCR
Advanced settingsprocessThreadsnumber8Max number of threads used by the application
processQueuenumber30Size of application process queue, should be at least 3 times the process threads
backoffTimenumber1000Time (in milliseconds) to wait before trying to add a job to the queue when it is full
debugbooleanfalseCheck if you want debug messages enabled

Example Configuration

Code Block
{
	"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
	"processTimeout": 600000,
	"imageDirectory": "C:\\dev\\tempDir",
	"maxSize": "10mb",
	"confidenceThreshold": 80,
	"outputFormat": "png",
	"imageType": "bilevel",
	"dpi": 300,
	"mimeTypeXPath": "/doc/normalizedMimeType",
	"pdfMimeTypes": "aspire/pdf",
	"imageMimeTypes": "aspire/drawing",
	"startPage": 0,
	"endPage": 20,
	"processThreads": 8,
	"processQueue": 30,
	"backoffTime": 1000,
	"debug": true
}