The Tesseract OCR component can be configured using the Aspire REST API. It requires the following entities to be created:

Below are the examples of how to configure the component. 


Create Workflow


NOTE: Some options in the following table collapse or are displayed only when selecting other options, such as a checkbox or selects.

Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"tesseract-ocr"

propertiesYes-NoConfiguration object
tesseractPathYes-NoComplete URL where the tesseract application is installedC:\Tesseract-OCR\tesseract
processTimeoutYes600000NoMaximum time (in milliseconds) to wait for the process600000
imageDirectoryYes-NoDirectory used to store the temporary files generated during OCRC:\tempDir
maxSizeYes10mbNoApply image correction only for those images which fall under this size. (i.e., 250kb, 5mb, 1gb)10mb
confidenceThresholdYes80.0NoMinimum confidence value to accept the OCR output80.0
outputFormatYesjpgNoImage format of the outputpng
imageTypeYes
bilevel
NoImage color scale of the outputbilevel
dpiYes300NoImage dots per inch of the output300
mimeTypeXPathYes/doc/mimeTypeNoXPath's expression to get the document Mime type/doc/normalizedMimeType
pdfMimeTypesYes-YesMime type for PDF documentsaspire/pdf
imageMimeTypesYes-YesMime type for image documentsaspire/drawing
startPageYes0NoPage to start processing with OCR. If the value is 0 will start from the first page.0
endPageYes20NoLast page to process with OCR20
processThreadsYes8NoMax number of threads used by the application8
processQueueYes30NoSize of application process queue, should be at least 3 times the process threads30
backoffTimeYes1000NoTime (in milliseconds) to wait before trying to add a job to the queue when it is full1000
debugNofalseNoOption if you want debug messages enabled.false

Example 

NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Tesseract OCR Component - App Bundle page

PUT aspire/_api/workflows/9bdf3efb-c266-46ac-ab59-eb8eda87d9e9/rules
{     
	"type": "application",
	"appName": "Tesseract Ocr",
	"appType": "tesseract-ocr",
	"config": "com.accenture.aspire:app-ocr-processor",
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}  

Update Workflow


Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"tesseract-ocr"

propertiesYes-NoConfiguration object
tesseractPathYes-NoComplete URL where the tesseract application is installedC:\Tesseract-OCR\tesseract
processTimeoutYes600000NoMaximum time (in milliseconds) to wait for the process600000
imageDirectoryYes-NoDirectory used to store the temporary files generated during OCRC:\tempDir
maxSizeYes10mbNoApply image correction only for those images which fall under this size. (i.e., 250kb, 5mb, 1gb)10mb
confidenceThresholdYes80.0NoMinimum confidence value to accept the OCR output80.0
outputFormatYesjpgNoImage format of the outputpng
imageTypeYes
bilevel
NoImage color scale of the outputbilevel
dpiYes300NoImage dots per inch of the output300
mimeTypeXPathYes/doc/mimeTypeNoXPath's expression to get the document Mime type/doc/normalizedMimeType
pdfMimeTypesYes-YesMime type for PDF documentsaspire/pdf
imageMimeTypesYes-YesMime type for image documentsaspire/drawing
startPageYes0NoPage to start processing with OCR. If the value is 0 will start from the first page.0
endPageYes20NoLast page to process with OCR20
processThreadsYes8NoMax number of threads used by the application8
processQueueYes30NoSize of application process queue, should be at least 3 times the process threads30
backoffTimeYes1000NoTime (in milliseconds) to wait before trying to add a job to the queue when it is full1000
debugNofalseNoOption if you want debug messages enabled.false

Example 

{     
	"type": "application",
	"appName": "Tesseract Ocr",
	"appType": "tesseract-ocr",
	"config": "com.accenture.aspire:app-ocr-processor",
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}  
  • No labels