The Tesseract OCR Component is a workflow component for Aspire.


Elastic Cache Lookup
Factory Namecom.accenture.aspire:app-ocr-processor
subTypejob-input
InputsThe field from which you want to get the value and a field to be created in the Aspire document.
OutputsAspire object that contains a subjob with metadata and checked out content extracted from the OCR process

Configuration


This section lists all configuration parameters available to configure the Tesseract OCR component.


ElementTypeDefaultDescription
OCR SettingstesseractPathtext-Tesseract binary location
processTimeoutnumber600000Time (in milliseconds) to wait before killing a tesseract process
imageDirectorytext-Directory used to store the temporary files generated during OCR
maxSizetext10mbApply image correction only for those images that fall under this size. (i.e., 250kb, 5mb, 1gb)
confidenceThresholdnumber80.0Minimum confidence value to accept the OCR output
Image creation settingsoutputFormatselectjpgImage format (JPG, PNG, tiff)
imageTypeselectbilevelImage color scale (bilevel, gray, rgba, rgb)
dpinumber300Image dots per inch
Mime Type settingsmimeTypeXPathtext/doc/mimeTypeXPath's expression to get the document Mime type
pdfMimeTypesarray-Mime type for PDF documents
imageMimeTypesarray-Mime type for image documents
Page splitter settingsstartPagenumber0Page to start processing with OCR. If the value is 0 will start from the first page.
endPagenumber20Last page to process with OCR
Advanced settingsprocessThreadsnumber8Max number of threads used by the application
processQueuenumber30Size of application process queue, should be at least 3 times the process threads
backoffTimenumber1000Time (in milliseconds) to wait before trying to add a job to the queue when it is full
debugbooleanfalseCheck if you want debug messages enabled

Example Configuration

{
	"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
	"processTimeout": 600000,
	"imageDirectory": "C:\\dev\\tempDir",
	"maxSize": "10mb",
	"confidenceThreshold": 80,
	"outputFormat": "png",
	"imageType": "bilevel",
	"dpi": 300,
	"mimeTypeXPath": "/doc/normalizedMimeType",
	"pdfMimeTypes": "aspire/pdf",
	"imageMimeTypes": "aspire/drawing",
	"startPage": 0,
	"endPage": 20,
	"processThreads": 8,
	"processQueue": 30,
	"backoffTime": 1000,
	"debug": true
}