Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Tesseract Ocr component can be configured using the Aspire workflow section. It requires the following entities to be created

Below are the examples of how to configure the component. 


Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Workflow


NOTE: Some options in the following table collapse or are displayed only when selecting other options, such as a checkbox or selects.

Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"tesseract-ocr"

properties


Configuration object

tesseractPath

Yes

-

No

Complete URL where the tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000

No

Maximum time (in milliseconds) to wait for the process

600000

imageDirectory

Yes

-

No

Directory

used

to

store

the

temporary

files

generated

during

OCR

C:\tempDir

maxSize

Yes

10mb

No

Apply

image

correction

only

for

those

images

falls

under

this

size.

(i.e.

250kb,

5mb,

1gb)

10mb

confidenceThreshold

Yes

80.0

No

Minimum

confidence

value

to

accept

the

ocr

output

80.0

outputFormat

Yes

-

No

Image

format

of

the

output

png

imageType

Yes

-

No

Image

color

scale

of

the

output

bilevel

dpi

Yes

300

No

Image

dots

per

inch

of

the

output

300

mimeTypeXPath

Yes

5000

/doc/mimeType

No

Time period (in milliseconds) to throttle the connection.5000Max connections per period

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

500

-

NoMaximum number of connections used during the throttling period.500Maximum retries

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

3

-

No

Maximum number of retries for a failed document.3Retry delay

Mime type for image documents

aspire/drawing

startPage

Yes

5000

0

No

Time (in milliseconds) to wait before a retry.5000Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myidUppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.FALSELookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutputDebug

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug

No

false

No

Option if you want debug messages enabled.

FALSE
Hit sizeNo1000NoMax mount of hits returned by the cache lookup. If -1 all hits will be returned.
1000

Example 

NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Elastic Cache Lookup App Bundle page

Code Block
themeRDark
titlePUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065
{     
	"type": "application",
	"_type": "application",
	"appName": "Tesseract Ocr",
	"appType": "tesseract-ocr",
	"config": "com.accenture.aspire:app-ocr-processor",
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}

Update Workflow



Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"

Elastic Cache Lookup

tesseract-ocr"

properties
Yes-No



Configuration object
Server url

tesseractPath

Yes

-

No

Complete URL where the

feeds will be sent.http://localhost:9200/bulk_AuthenticationNoNoneYesUser with the permissions to read from the Elastic index specified.none, basic, aws

tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000

No

Maximum time (in milliseconds) to wait for the process

600000

imageDirectory

properties

Yes

-

No

Configuration objectIndex

Directory used to store the temporary files generated during OCR

C:\tempDir

maxSize

Yes

-

10mb

No

The elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4)[{"index":"index1"}]

Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)

10mb

confidenceThreshold

Yes

80.0

No

Minimum confidence value to accept the ocr output

80.0

outputFormat

Yes

-

No

Image format of the output

png

imageType

Yes

-

No

Image color scale of the output

bilevel

dpi

Yes

300

No

Image dots per inch of the output

300

mimeTypeXPath

Yes

/doc/mimeType

No

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

-

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

-

No

Mime type for image documents

aspire/drawing

startPage

Yes

0

No

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug

Idle connection timeoutYes3600000NoMaximum time (in milliseconds) to keep an idle connection open.3600000Max connectionsYes100NoMaximum number of connections to be opened.100Connections per targetYes10NoMaximum number of connections opened for the same target.10Connection timeoutYes15000NoMaximum time (in milliseconds) to wait for the connection.15000Socket timeoutYes15000NoMaximum time (in milliseconds) to wait for a socket response.15000Throttling periodYes5000NoTime period (in milliseconds) to throttle the connection.5000Max connections per periodYes500NoMaximum number of connections used during the throttling period.500Maximum retriesYes3NoMaximum number of retries for a failed document.3Retry delayYes5000NoTime (in milliseconds) to wait before a retry.5000Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myidUppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.TRUELookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutputDebug1000

No

false

No

Option if you want debug messages enabled.

TRUEHit sizeNo1000NoMax mount of hits returned by the cache lookup. If -1 all hits will be returned.
FALSE

Example

Code Block
themeRDark
{  	
	"description": "tesseract-ocr",
	"properties": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}