Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 10

The Tesseract Ocr Elastic Cache Lookup component can be configured using the Aspire workflow section. It requires the following entities to be created

Below are the examples of how to configure the component. 


Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Workflow


NOTE: Some options in the following table collapse or are displayed only when selecting other options, such as a checkbox or selects.

Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"

tesseract-ocr

Elastic Cache Lookup"

propertiesYes-NoConfiguration object
tesseractPath

Server urlYes-NoComplete URL where
the tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000
the feeds will be sent.http://localhost:9200/bulk_
AuthenticationNoNoneYesUser with the permissions to read from the Elastic index specified.none, basic, aws
IndexYes-NoThe elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4)[{"index":"index1"}]
Idle connection timeoutYes3600000NoMaximum time (in milliseconds) to
wait for the process

600000

imageDirectory
keep an idle connection open.3600000
Max connectionsYes
-
100No

Directory used to store the temporary files generated during OCR

C:\tempDir

maxSize
Maximum number of connections to be opened.100
Connections per targetYes
10mb
10No

Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)

10mb

Maximum number of connections opened for the same target.10
Connection timeoutYes15000NoMaximum time (in milliseconds) to wait for the connection.15000
Socket timeoutYes15000NoMaximum time (in milliseconds) to wait for a socket response.15000
Throttling periodYes5000NoTime period (in milliseconds) to throttle the connection.5000
Max connections per periodYes500NoMaximum number of connections used during the throttling period.500
Maximum retriesYes3NoMaximum number of retries for a failed document.3
Retry delayYes5000NoTime (in milliseconds) to wait before a retry.5000
Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000
Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500
Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5
Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]
Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myid
Uppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.FALSE
Lookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutput
Debug

confidenceThreshold

Yes

80.0

No

Minimum confidence value to accept the ocr output

80.0

outputFormat

Yes

-

No

Image format of the output

png

imageType

Yes

-

No

Image color scale of the output

bilevel

dpi

Yes

300

No

Image dots per inch of the output

300

mimeTypeXPath

Yes

/doc/mimeType

No

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

-

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

-

No

Mime type for image documents

aspire/drawing

startPage

Yes

0

No

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug
NofalseNoOption if you want debug messages enabled.FALSE
Hit sizeNo1000NoMax mount of hits returned by the cache lookup. If -1 all hits will be returned.1000

Example 

NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Elastic Cache Lookup App Bundle page

Code Block
themeRDark
titlePUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065
{
      
	"typedescription": "applicationElastic Cache Lookup",
	"_type    "properties": "application",
{
	 	"appNameurl": "Tesseract Ocrhttp://localhost:9200",
	"appType        "authType": "tesseract-ocrnone",
	"config        "index": "com.accenture.aspire:app-ocr-processorindex_name",
		"descriptionidleConnectionTimeout": "tesseract-ocr"3600000,
	"properties        "maxConnections": {
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 100,
        "maxConnectionsPerRoute": 10,
        "connectionTimeout": 15000,
        "socketTimeout": 15000,
        "useThrottling": false,
        "maxRetries": 3,
        "retryWaitTime": 5000,
		"cache": true,
        "eviction": "size",
        "evictionMaxSize": 1000,
		"esIndexLookupField": "indexNaame",
        "sourceLookupField": "myid",
        "sourceLookupFieldToUpperCase": false,
        "lookupOutputField": "myidOutput",
        "debug": false,
        "size": 1000       
	}
}

Update Workflow


Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"

tesseract-ocr

Elastic Cache Lookup"

propertiesYes-NoConfiguration object
tesseractPath

Server urlYes-NoComplete URL where
the tesseract application is installed

C:\Tesseract-OCR\tesseract

processTimeout

Yes

600000
the feeds will be sent.http://localhost:9200/bulk_
AuthenticationNoNoneYesUser with the permissions to read from the Elastic index specified.none, basic, aws
propertiesYes-NoConfiguration object
IndexYes-NoThe elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4)[{"index":"index1"}]
Idle connection timeoutYes3600000NoMaximum time (in milliseconds) to
wait for the process

600000

imageDirectory
keep an idle connection open.3600000
Max connectionsYes
-
100No

Directory used to store the temporary files generated during OCR

C:\tempDir

maxSize
Maximum number of connections to be opened.100
Connections per targetYes
10mb
10No

Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)

10mb

Maximum number of connections opened for the same target.10
Connection timeoutYes15000NoMaximum time (in milliseconds) to wait for the connection.15000
Socket timeoutYes15000NoMaximum time (in milliseconds) to wait for a socket response.15000
Throttling periodYes5000NoTime period (in milliseconds) to throttle the connection.5000
Max connections per periodYes500NoMaximum number of connections used during the throttling period.500
Maximum retriesYes3NoMaximum number of retries for a failed document.3
Retry delayYes5000NoTime (in milliseconds) to wait before a retry.5000
Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000
Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500
Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5
Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]
Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myid
Uppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.TRUE
Lookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutput
Debug

confidenceThreshold

Yes

80.0

No

Minimum confidence value to accept the ocr output

80.0

outputFormat

Yes

-

No

Image format of the output

png

imageType

Yes

-

No

Image color scale of the output

bilevel

dpi

Yes

300

No

Image dots per inch of the output

300

mimeTypeXPath

Yes

/doc/mimeType

No

Xpath expression to get the document Mime type

/doc/normalizedMimeType

pdfMimeTypes

Yes

-

Yes

Mime type for PDF documents

aspire/pdf

imageMimeTypes

Yes

-

No

Mime type for image documents

aspire/drawing

startPage

Yes

0

No

Page to start processing with OCR. If value is 0 will start from the first page

0

endPage

Yes

20

No

Last page to process with OCR

20

processThreads

Yes

8

No

Max number of threads used by the application

8

processQueue

Yes

30

No

Size of application process queue, should be at least 3 times the process threads

30

backoffTime

Yes

1000

No

Time (in milliseconds) to wait before trying to add a job to the queue when it is full

1000

debug
NofalseNoOption if you want debug messages enabled.TRUE
Hit sizeNo1000NoMax mount of hits returned by the cache lookup.
FALSE
If -1 all hits will be returned.1000

Example

Code Block
themeRDark
{
    	
	"description": "tesseract-ocrElastic Cache Lookup",
	    "properties": {
	 	"tesseractPathurl": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\dev\\tempDirhttp://localhost:9200",
        "authType": "none",
        "index": "index_name",
		"maxSizeidleConnectionTimeout": "10mb",
		"confidenceThreshold3600000,
        "maxConnections": 100,
        "maxConnectionsPerRoute": 8010,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300        "connectionTimeout": 15000,
        "socketTimeout": 15000,
        "useThrottling": true,
        "maxRetries": 3,
        "retryWaitTime": 5000,
		"mimeTypeXPathcache": "/doc/normalizedMimeType",
		"pdfMimeTypestrue,
        "eviction": "aspire/pdfsize",
		"imageMimeTypes        "evictionMaxSize": "aspire/drawing"1000,
		"startPageesIndexLookupField": 0"indexNaame",
		"endPage        "sourceLookupField": 20"myid",
		"processThreads        "sourceLookupFieldToUpperCase": 8true,
		"processQueue        "lookupOutputField": 30"myidOutput",
		"backoffTime        "debug": 1000true,
		"debug": true        "size": 1000       
	} 
}