Takes the input stream and looks up token patterns from the stream in a given dictionary. The output will be the content with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN with matching phrases gathered into single tokens.

Entity Extractor
Factory Name	com.searchtechnologies.aspire:aspire-tokenizer
subType	extractor
Inputs	InputStream (set in the contentStream or contentBytes variable) which contains the content to be parsed.
Outputs	Set the doc.content with Input text marked up in-line with extracted matches.

Configuration

Element	Type	Default	Description
dictionaryFile	String	none	Required. Identifies the file location of the dictionary file, which contains a list of entries to be matched from the token stream. (see below) Multiple files may be added using config options shown below.
dictionaryOffset	Integer	0	The number of lines (entries) to skip from the beginning of the dictionary. This setting is mostly used to avoid loading all of a very large dictionary.
dictionaryEntries	Integer	0	The number of lines (entries) to load from the dictionary, starting from dictionaryOffset. Zero means load the entire file. This setting is mostly used to avoid loading all of a very large dictionary.
extractorName	String	"Extractor"	This name is logged with each hit, so that hits from multiple Extractors can be differentiated.
normalize	boolean	false	If true, when a hit is found, the term will be changed to the target text, if any.
debug	boolean	false	When true, a number of printouts are activated.

Example Configurations

<component name="MainDictLookup" subType="extractor" factoryName="aspire-tokenizer">
  <extractorName>Main Terms</extractorName>
  <dictionaryFile>testdata/nse.txt</dictionaryFile>
</component>

Page tree

Entity Extractor

Configuration

Example Configurations