Takes the input stream and looks up token patterns from the stream in a given dictionary. The output will be the content with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN with matching phrases gathered into single tokens.
Entity Extractor | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-tokenizer |
subType | extractor |
Inputs | InputStream (set in the contentStream or contentBytes variable) which contains the content to be parsed. |
Outputs | Set the doc.content with Input text marked up in-line with extracted matches. |
Element | Type | Default | Description |
---|---|---|---|
dictionaryFile | String | none | Required. Identifies the file location of the dictionary file, which contains a list of entries to be matched from the token stream. (see below) Multiple files may be added using config options shown below. |
dictionaryOffset | Integer | 0 | The number of lines (entries) to skip from the beginning of the dictionary. This setting is mostly used to avoid loading all of a very large dictionary. |
dictionaryEntries | Integer | 0 | The number of lines (entries) to load from the dictionary, starting from dictionaryOffset. Zero means load the entire file. This setting is mostly used to avoid loading all of a very large dictionary. |
extractorName | String | "Extractor" | This name is logged with each hit, so that hits from multiple Extractors can be differentiated. |
normalize | boolean | false | If true, when a hit is found, the term will be changed to the target text, if any. |
debug | boolean | false | When true, a number of printouts are activated. |
<component name="MainDictLookup" subType="extractor" factoryName="aspire-tokenizer"> <extractorName>Main Terms</extractorName> <dictionaryFile>testdata/nse.txt</dictionaryFile> </component>