Takes the input stream and looks up token patterns from the stream in a given dictionary. The output will be the content with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN with matching phrases gathered into single tokens.

Entity Extractor
Factory Namecom.searchtechnologies.aspire:aspire-tokenizer
subType

extractor

InputsInputStream (set in the contentStream or contentBytes variable) which contains the content to be parsed.
Outputs Set the doc.content with Input text marked up in-line with extracted matches.

Configuration

ElementTypeDefaultDescription
dictionaryFileStringnoneRequired. Identifies the file location of the dictionary file, which contains a list of entries to be matched from the token stream. (see below) Multiple files may be added using config options shown below.
dictionaryOffsetInteger0The number of lines (entries) to skip from the beginning of the dictionary. This setting is mostly used to avoid loading all of a very large dictionary.
dictionaryEntriesInteger0The number of lines (entries) to load from the dictionary, starting from dictionaryOffset. Zero means load the entire file. This setting is mostly used to avoid loading all of a very large dictionary.
extractorNameString"Extractor"This name is logged with each hit, so that hits from multiple Extractors can be differentiated.
normalizebooleanfalseIf true, when a hit is found, the term will be changed to the target text, if any.
debugbooleanfalseWhen true, a number of printouts are activated.

Example Configurations

<component name="MainDictLookup" subType="extractor" factoryName="aspire-tokenizer">
  <extractorName>Main Terms</extractorName>
  <dictionaryFile>testdata/nse.txt</dictionaryFile>
</component>
  • No labels