You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

The Tesseract Ocr component can be configured using the Aspire workflow section. It requires the following entities to be created

Below are the examples of how to configure the component. 


Create Workflow


NOTE: Some options in the following table collapse or are displayed only when selecting other options, such as a checkbox or selects.

Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"tesseract-ocr"

properties


Configuration object

tesseractPath

Yes

-

No

Complete URL where the tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000

No

Maximum time (in milliseconds) to wait for the process

600000

imageDirectory

Yes

-

No

Directory used to store the temporary files generated during OCR

C:\tempDir

maxSize

Yes

10mb

No

Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)

10mb

confidenceThreshold

Yes

80.0

No

Minimum confidence value to accept the ocr output

80.0

outputFormat

Yes

-

No

Image format of the output

png

imageType

Yes

-

No

Image color scale of the output

bilevel

dpi

Yes

300

No

Image dots per inch of the output

300

mimeTypeXPath

Yes

/doc/mimeType

No

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

-

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

-

No

Mime type for image documents

aspire/drawing

startPage

Yes

0

No

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug

No

false

No

Option if you want debug messages enabled.

FALSE

Example 

NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Elastic Cache Lookup App Bundle page

PUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065
{     
	"type": "application",
	"_type": "application",
	"appName": "Tesseract Ocr",
	"appType": "tesseract-ocr",
	"config": "com.accenture.aspire:app-ocr-processor",
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}

Update Workflow



Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"tesseract-ocr"

properties


Configuration object

tesseractPath

Yes

-

No

Complete URL where the tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000

No

Maximum time (in milliseconds) to wait for the process

600000

imageDirectory

Yes

-

No

Directory used to store the temporary files generated during OCR

C:\tempDir

maxSize

Yes

10mb

No

Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)

10mb

confidenceThreshold

Yes

80.0

No

Minimum confidence value to accept the ocr output

80.0

outputFormat

Yes

-

No

Image format of the output

png

imageType

Yes

-

No

Image color scale of the output

bilevel

dpi

Yes

300

No

Image dots per inch of the output

300

mimeTypeXPath

Yes

/doc/mimeType

No

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

-

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

-

No

Mime type for image documents

aspire/drawing

startPage

Yes

0

No

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug

No

false

No

Option if you want debug messages enabled.

FALSE

Example

{  	
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
} 
  • No labels