Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | "tesseract-ocr" |
properties | Yes | - | No | Configuration object | |
tesseractPath | Yes | - | No | Complete URL where the tesseract application is installed | C:\Tesseract-OCR\tesseract |
processTimeout | Yes | 600000 | No | Maximum time (in milliseconds) to wait for the process | 600000 |
imageDirectory | Yes | - | No | Directory used to store the temporary files generated during OCR | C:\tempDir |
maxSize | Yes | 10mb | No | Apply image correction only for those images which fall under this size. (i.e., 250kb, 5mb, 1gb) | 10mb |
confidenceThreshold | Yes | 80.0 | No | Minimum confidence value to accept the OCR output | 80.0 |
outputFormat | Yes | jpg | No | Image format of the output | png |
imageType | Yes | bilevel | No | Image color scale of the output | bilevel |
dpi | Yes | 300 | No | Image dots per inch of the output | 300 |
mimeTypeXPath | Yes | /doc/mimeType | No | XPath's expression to get the document Mime type | /doc/normalizedMimeType |
pdfMimeTypes | Yes | - | Yes | Mime type for PDF documents | aspire/pdf |
imageMimeTypes | Yes | - | Yes | Mime type for image documents | aspire/drawing |
startPage | Yes | 0 | No | Page to start processing with OCR. If the value is 0 will start from the first page. | 0 |
endPage | Yes | 20 | No | Last page to process with OCR | 20 |
processThreads | Yes | 8 | No | Max number of threads used by the application | 8 |
processQueue | Yes | 30 | No | Size of application process queue, should be at least 3 times the process threads | 30 |
backoffTime | Yes | 1000 | No | Time (in milliseconds) to wait before trying to add a job to the queue when it is full | 1000 |
debug | No | false | No | Option if you want debug messages enabled. | false |
NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Tesseract OCR Component - App Bundle page
{ "type": "application", "appName": "Tesseract Ocr", "appType": "tesseract-ocr", "config": "com.accenture.aspire:app-ocr-processor", "description": "tesseract-ocr", "properties": { "tesseractPath": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\tempDir", "maxSize": "10mb", "confidenceThreshold": 80, "outputFormat": "png", "imageType": "bilevel", "dpi": 300, "mimeTypeXPath": "/doc/normalizedMimeType", "pdfMimeTypes": "aspire/pdf", "imageMimeTypes": "aspire/drawing", "startPage": 0, "endPage": 20, "processThreads": 8, "processQueue": 30, "backoffTime": 1000, "debug": true } }
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | "tesseract-ocr" |
properties | Yes | - | No | Configuration object | |
tesseractPath | Yes | - | No | Complete URL where the tesseract application is installed | C:\Tesseract-OCR\tesseract |
processTimeout | Yes | 600000 | No | Maximum time (in milliseconds) to wait for the process | 600000 |
imageDirectory | Yes | - | No | Directory used to store the temporary files generated during OCR | C:\tempDir |
maxSize | Yes | 10mb | No | Apply image correction only for those images which fall under this size. (i.e., 250kb, 5mb, 1gb) | 10mb |
confidenceThreshold | Yes | 80.0 | No | Minimum confidence value to accept the OCR output | 80.0 |
outputFormat | Yes | jpg | No | Image format of the output | png |
imageType | Yes | bilevel | No | Image color scale of the output | bilevel |
dpi | Yes | 300 | No | Image dots per inch of the output | 300 |
mimeTypeXPath | Yes | /doc/mimeType | No | XPath's expression to get the document Mime type | /doc/normalizedMimeType |
pdfMimeTypes | Yes | - | Yes | Mime type for PDF documents | aspire/pdf |
imageMimeTypes | Yes | - | Yes | Mime type for image documents | aspire/drawing |
startPage | Yes | 0 | No | Page to start processing with OCR. If the value is 0 will start from the first page. | 0 |
endPage | Yes | 20 | No | Last page to process with OCR | 20 |
processThreads | Yes | 8 | No | Max number of threads used by the application | 8 |
processQueue | Yes | 30 | No | Size of application process queue, should be at least 3 times the process threads | 30 |
backoffTime | Yes | 1000 | No | Time (in milliseconds) to wait before trying to add a job to the queue when it is full | 1000 |
debug | No | false | No | Option if you want debug messages enabled. | false |
{ "type": "application", "appName": "Tesseract Ocr", "appType": "tesseract-ocr", "config": "com.accenture.aspire:app-ocr-processor", "description": "tesseract-ocr", "properties": { "tesseractPath": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\tempDir", "maxSize": "10mb", "confidenceThreshold": 80, "outputFormat": "png", "imageType": "bilevel", "dpi": 300, "mimeTypeXPath": "/doc/normalizedMimeType", "pdfMimeTypes": "aspire/pdf", "imageMimeTypes": "aspire/drawing", "startPage": 0, "endPage": 20, "processThreads": 8, "processQueue": 30, "backoffTime": 1000, "debug": true } }