Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Elastic Cache Lookup Tesseract Ocr component can be configured using the Aspire workflow section. It requires the following entities to be created

Below are the examples of how to configure the component. 


Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Create Workflow


NOTE: Some options in the following table collapse or are displayed only when selecting other options, such as a checkbox or selects.

Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"

Elastic Cache Lookup

tesseract-ocr"

propertiesYes-NoConfiguration object
Server url

tesseractPathYes-NoComplete URL where the
feeds will be sent.http://localhost:9200/bulk_AuthenticationNoNoneYesUser with the permissions to read from the Elastic index specified.none, basic, awsIndexYes-NoThe elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4)[{"index":"index1"}]Idle connection timeoutYes3600000
tesseract application is installedC:\Tesseract-OCR\tesseract
processTimeoutYes600000NoMaximum time (in milliseconds) to
keep an idle connection open.
wait for the process600000
imageDirectory
3600000Max connections
Yes
100
-No
Maximum number of connections to be opened.100Connections per target
Directory used to store the temporary files generated during OCRC:\tempDir
maxSizeYes
10
10mbNo
Maximum number of connections opened for the same target.10Connection timeoutYes15000NoMaximum time (in milliseconds) to wait for the connection.15000Socket timeoutYes15000NoMaximum time (in milliseconds) to wait for a socket response.15000Throttling periodYes5000NoTime period (in milliseconds) to throttle the connection.5000Max connections per periodYes500NoMaximum number of connections used during the throttling period.500Maximum retriesYes3NoMaximum number of retries for a failed document.3Retry delayYes5000NoTime (in milliseconds) to wait before a retry.5000Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myidUppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.FALSELookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutputDebug
Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)10mb
confidenceThresholdYes80.0NoMinimum confidence value to accept the ocr output80.0
outputFormatYesjpgNoImage format of the outputpng
imageTypeYes
bilevel
NoImage color scale of the outputbilevel
dpiYes300NoImage dots per inch of the output300
mimeTypeXPathYes/doc/mimeTypeNoXpath expression to get the document Mime type/doc/normalizedMimeType
pdfMimeTypesYes-YesMime type for PDF documentsaspire/pdf
imageMimeTypesYes-YesMime type for image documentsaspire/drawing
startPageYes0NoPage to start processing with OCR. If value is 0 will start from the first page0
endPageYes20NoLast page to process with OCR20
processThreadsYes8NoMax number of threads used by the application8
processQueueYes30NoSize of application process queue, should be at least 3 times the process threads30
backoffTimeYes1000NoTime (in milliseconds) to wait before trying to add a job to the queue when it is full1000
debugNofalseNoOption if you want debug messages enabled.
FALSEHit sizeNo1000NoMax mount of hits returned by the cache lookup. If -1 all hits will be returned.1000
false

Example 

NOTE: The following structure is not ordered by the sections of the component configuration, as found on the OCR Components App Bundle page

Code Block
themeRDark
titlePUT aspire/_api/credentials/2a5ca234-e328-4d40-bb2a-2df3e550b065
{
      
	"descriptiontype": "Elastic Cache Lookupapplication",
    "properties	"appName": {
	 "Tesseract Ocr",
	"urlappType": "http://localhost:9200tesseract-ocr",
        "authType	"config": "nonecom.accenture.aspire:app-ocr-processor",
        "index	"description": "index_nametesseract-ocr",
		"idleConnectionTimeoutproperties": 3600000,
        "maxConnections": 100,
        "maxConnectionsPerRoute": 10,
        "connectionTimeout": 15000,
        "socketTimeout": 15000,
        "useThrottling": false,
        "maxRetries": 3,
        "retryWaitTime": 5000,
		"cache": true,
        "eviction": "size",
        "evictionMaxSize": 1000,
		"esIndexLookupField": "indexNaame",
        "sourceLookupField": "myid",
        "sourceLookupFieldToUpperCase": false,
        "lookupOutputField": "myidOutput",
        "debug": false,
        "size": 1000       
	}
}

{
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}  

Update Workflow


Field

Required

Default

Multiple

NotesExample
descriptionYes-NoName of the component application.

"

Elastic Cache Lookup

tesseract-ocr"

propertiesYes-NoConfiguration object
Server url

tesseractPathYes-NoComplete URL where the
feeds will be sent.http://localhost:9200/bulk_AuthenticationNoNoneYesUser with the permissions to read from the Elastic index specified.none, basic, aws
tesseract application is installedC:\Tesseract-OCR\tesseract
processTimeoutYes600000NoMaximum time (in milliseconds) to wait for the process600000
imageDirectory
properties
Yes-No
Configuration objectIndex
Directory used to store the temporary files generated during OCRC:\tempDir
maxSizeYes
-
10mbNo
The elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4)[{"index":"index1"}]
Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)10mb
confidenceThresholdYes80.0NoMinimum confidence value to accept the ocr output80.0
outputFormatYesjpgNoImage format of the outputpng
imageTypeYes
bilevel
NoImage color scale of the outputbilevel
dpiYes300NoImage dots per inch of the output300
mimeTypeXPathYes/doc/mimeTypeNoXpath expression to get the document Mime type/doc/normalizedMimeType
pdfMimeTypesYes-YesMime type for PDF documentsaspire/pdf
imageMimeTypesYes-YesMime type for image documentsaspire/drawing
startPageYes0NoPage to start processing with OCR. If value is 0 will start from the first page0
endPageYes20NoLast page to process with OCR20
processThreadsYes8NoMax number of threads used by the application8
processQueueYes30NoSize of application process queue, should be at least 3 times the process threads30
backoffTimeYes1000NoTime (in milliseconds) to wait before trying to add a job to the queue when it is full1000
debug
Idle connection timeoutYes3600000NoMaximum time (in milliseconds) to keep an idle connection open.3600000Max connectionsYes100NoMaximum number of connections to be opened.100Connections per targetYes10NoMaximum number of connections opened for the same target.10Connection timeoutYes15000NoMaximum time (in milliseconds) to wait for the connection.15000Socket timeoutYes15000NoMaximum time (in milliseconds) to wait for a socket response.15000Throttling periodYes5000NoTime period (in milliseconds) to throttle the connection.5000Max connections per periodYes500NoMaximum number of connections used during the throttling period.500Maximum retriesYes3NoMaximum number of retries for a failed document.3Retry delayYes5000NoTime (in milliseconds) to wait before a retry.5000Max number of entriesNo1000NoMax total number of entries to keep in the cache.1000Max Total Weight (MB)No500NoSpecifies the maximum weight of entries the cache must contain.500Time (min)No5NoRemove records that have been idle for an amount of time in minutes.5Index lookup fieldYes-NoElastic index field name for the lookup,[{"index":"index1"}]Source lookup fieldYes-NoSpecify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.myidUppercase the source lookup field valueNotrueNoConvert the value of the source field into UPPERCASE value.TRUELookup output fieldYes-NoOutput fields from the lookup will be placed under this configured object.myidOutputDebug
NofalseNoOption if you want debug messages enabled.
TRUEHit sizeNo1000NoMax mount of hits returned by the cache lookup. If -1 all hits will be returned.1000
Example
false

Example 

Code Block
themeRDark
{
      
	"descriptiontype": "Elastic Cache Lookupapplication",
    "properties	"appName": { "Tesseract Ocr",
	 	"urlappType": "http://localhost:9200tesseract-ocr",
        "authType	"config": "nonecom.accenture.aspire:app-ocr-processor",
        "index	"description": "index_nametesseract-ocr",
		"idleConnectionTimeoutproperties": 3600000,
        "maxConnections": 100,
        "maxConnectionsPerRoute": 10,
        "connectionTimeout": 15000,
        "socketTimeout": 15000,
        "useThrottling": true,
        "maxRetries": 3,
        "retryWaitTime": 5000,
		"cache": true,
        "eviction": "size",
        "evictionMaxSize": 1000,
		"esIndexLookupField": "indexNaame",
        "sourceLookupField": "myid",
        "sourceLookupFieldToUpperCase": true,
        "lookupOutputField": "myidOutput",
        "debug": true,
        "size": 1000       
	}
}{
		"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
		"processTimeout": 600000,
		"imageDirectory": "C:\\tempDir",
		"maxSize": "10mb",
		"confidenceThreshold": 80,
		"outputFormat": "png",
		"imageType": "bilevel",
		"dpi": 300,
		"mimeTypeXPath": "/doc/normalizedMimeType",
		"pdfMimeTypes": "aspire/pdf",
		"imageMimeTypes": "aspire/drawing",
		"startPage": 0,
		"endPage": 20,
		"processThreads": 8,
		"processQueue": 30,
		"backoffTime": 1000,
		"debug": true
	} 
}