Field | Optional | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
type | No | - | No | The value must be the same as the type of the seed that will use this connector. | "filesystem" |
description | No | - | No | Name of the connector object. | "MyFileSystemConnector" |
artifact | No | - | No | The mvn coordinates of the connector. | "com.accenture.aspire:aspire-filesystem-source" |
properties | No | - | No | Configuration object | |
debug | Yes | false | No | Enables the debug messages | true / false |
wDebug | Yes | false | No | Enables job logging. | true / false |
enableStatistics | Yes | false | No | Enable to gather pipeline job statistics in the debug console | true / false |
infoCacheSize | No | 100 | No | The size of the Source Info cache used by the connector | 200 |
mapCacheSize | No | 100 | No | The number of Storage maps kept in memory per seed | 200 |
setCacheSize | No | 100 | No | The number of Sets kept in memory per seed | 200 |
identityCacheSize | No | 100 | No | The number of identities kept in memory per seed. | 200 |
enableFetcher | No | true | No | Enables document fetching for the seeds that use this connector. | true / false |
enableTextExtract | No | true | No | Enables text extraction. By default, connectors use Apache Tika to extract text from downloaded documents. To apply special text processing to a downloaded document in the workflow, disable text extraction. The downloaded document is then available as a content stream. | true / false |
extractTextMaxSize | No | 20971520 | No | Maximum extract text size in number of characters or \"unlimited\". Doesn't apply if HTML Output option is enabled. | 10000 |
extractTimeout | No | 180000 | No | Maximum time (in ms) to wait for the text extraction. | 18000 |
xmlMaxDepth | No | 2147483647 | No | The max depth level for a file inner structure. Can be used to block denial of service attacks or corrupted files. | 2147483647 |
structuredText | No | false | No | Include formatting in output (in HTML) instead of plain text. | true / false |
tikaConfig | No | - | No | Path for Apache Tika configuration file. It can be passed as empty to use the default configuration | "/path/to/tikaConfig.xml" / "" |
pdfParserProperties | No | false | No | Enable to change the default PDFBox properties | true / false |
enableAutoSpace | No | true | No | If enabled, the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters. | true / false |
suppressDuplicateOverlappingText | No | false | No | If enabled the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155) | true / false |
extractAnnotationText | No | true | No | If enabled, text in annotations will be extracted. | true / false |
sortByPosition | No | false | No | If enabled, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered \"in order\"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). | true / false |
extractAcroFormContent | No | true | No | If enabled, extract content from AcroForms at the end of the document | true / false |
extractInlineImages | No | false | No | If enabled, extract inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution. | true / false |
extractUniqueInlineImagesOnly | No | true | No | Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. | true / false |
enable-non-text-filter | No | false | No | Enable to filter non text documents. | true / false |
enableFetchForNonText | No | true | No | Enable if the workflow needs to stream the non-text documents. | true / false |
non-text-document | No | false | No | Enable to filter using document extensions, disable to | true / false |
nonTextDocumentsExtensions | No | jpg,jpeg,gif,png,tif, mp3,mp4,mpg,mpeg, avi,mkv,wav,bmp,swf, war,rar,tgz,dll,exe,class | No | Comma separated list of non-text document extensions. Used based on the non-text-document value | "jpg,jpeg,gif,png" |
nonTextDocuments | No | - | No | Path to a file containing a list of regex that matches the non-text documents, one regex expression per line. Used based on the non-text-document value | "config/nonTextDocuments.txt" |
metadataMap | Yes | [ ] | Yes | Settings for mapping extracted fields to a destination field. | |
from | No | - | No | Field to be mapped | "fieldA" |
to | No | - | No | Field where the value will be mapped | "fieldB" |
entity | Yes | "user" | No | Entity (user / group) represented by the static ACL. | "user" / "group" |
access | Yes | "allow" | No | Access (allow / deny) granted by the ACL. | "allow" / "deny" |
$action.getHelper().renderConfluenceMacro("$codeS$body$codeE")