The Text Tagger stage scans the document for the occurrence of words/phrases and adds counts of the phrases to the document <tags> element. It uses a number of "tag files", containing lists of phrases to match and optionally their synonyms. The output will count the occurrences of the tags and synonyms and output XML. Outputs are per tag file, so the tagger can tag documents against differing lists and maintain separation. The tagger tags the content element by default, but can tag multiple fields from the AspireObject. Optionally, in the case of a field that is marked as the document body, the tagger will separately count occurrences of words with in a certain distance of the start of the text, allowing subsequent stages to bias on proximity to the start of the text

Tagger
Factory Name	com.searchtechnologies.aspire:aspire-tag-text
subType	default
Inputs	<content> (by default), but potentially any field from the AspireObject and one or more tag files
Outputs	<tags> in the AspireObject

Configuration

Element	Type	Default	Description
output	String	tags	The base output element for the tags in the Aspire document.

Configuration of Content to Tag

By default, the tagger processes the <content> tag from the Aspire document. However, it is possible to configure it to tag other fields.

Element	Type	Default	Description
tagFields/tagField/@field	String	None (must be specified)	The field to process.
tagFields/tagField/@isBody	boolean	false	Flag to indicate this field contains the document body.
tagFields/tagField/@startTokens	int	0	For the document body only, the number of tokens from the start of the text to consider separately as being near the document start.

Note:

If no <tagField> tags exist, the tagger defaults to processing the <content> element as the document body.
More than one <tagField> element may be used, to tag more than one field.
If any <tagField> tags are used, you MUST specify ALL fields to tag, including the <content> if required.

Tag List Configuration

The tagger requires at least one tag list in order to tag files. This text file contains a list of phrases and their synonyms. Each phrase should appear on a new line and any synonyms should appear on subsequent lines, preceded with a + symbol.

Element	Type	Default	Description
tagLists/tagList/@id	String	None (must be specified)	An identifier for the tags. This will be output in the Aspire Document against any tags from this file that are identified.
tagLists/tagList/@tagFile	String	None (must be specified)	The path to the file containing the tags. Relative to $ASPIRE_HOME.

Note: More than one <tagList> may be specified.

Example tag file

UK
	+United Kingdom
	+Great Britain
	+Wales
	+Scotland
	+England
	+English
	+British
	+Briton
	+Scottish
	+Welsh

USA
	+United States of America
	+United States
	+America

Tokeniser Configuration

By default, the tagger uses the classes org.apache.lucene.analysis.standard.StandardTokenizer and org.apache.lucene.analysis.LowerCaseFilter to tokenize and lowercase the document text and phrases from the tag files. However, these may be overriden if required.

Element	Type	Default	Description
tokenProcessing/tokenizer/@class	String	org.apache.lucene.analysis.standard.StandardTokenizer	String representing the class to use for the tokeniser. Must conform to the parameters/return type of org.apache.lucene.analysis.standard.StandardTokenizer.
tokenProcessing/tokenizer/@jar	String	None (built in)	Jar file the tokenizer class file exists in. Relative to $ASPIRE_HOME.
tokenProcessing/tokenFilter/@class	String	org.apache.lucene.analysis.LowerCaseFilter	String representing the class to use as a filter. Must conform to the parameters/return type of org.apache.lucene.analysis.LowerCaseFilter.
tokenProcessing/tokenFilter/@jar	String	None (built in)	Jar file the token filter class file exists in. Relative to $ASPIRE_HOME.

Note:

More than one token filter may be used
The element may contain further attributes. If the configured class implements the AspireInitializer interface, this will be called, and the config element of the component called. This allows the classes to be initialised with any required information

Example Configurations

Simple

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <tagLists>
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
   </tagLists>
 </component>

Complex

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <output>tags</output>
   <tagFields>
     <tagField field="title"/>
     <tagField field="content" isBody="true" startTokens="20"/>
   </tagFields>
   <tagLists>
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
     <tagList id="hr" tagFile="data/tagFiles/hr.txt"/>
     <tagList id="sport" tagFile="data/tagFiles/sports.txt"/>
   </tagLists>
   <tokenProcessing>
     <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizer"/>
     <tokenFilter class="org.apache.lucene.analysis.LowerCaseFilter"/>
     <tokenFilter jar="lib/aspire-lemmatizer.jar" class="org.apache.lucene.analysis.LemmatizerFilter" dictionary="data/dict/gcide_out.xml"/>
   </tokenProcessing>
 </component>

Even More Complex

If the above kinds of configurations are not enough for your needs, use a full text tokenization pipeline by setting up a Tokenization Manager, and then adding all the token filters desired. The tag lists will be handled by Extractor stages.

Example Output

The following is a sample output for the tagger.

Note: Does not show synonyms or sub-tags (will add an example for that later).

 <doc>
   .
   .
   .
   <tags source="textTagger">
       <category name="responsibility">
           <tag body="1" name="administrator"/>
           <tag body="1" name="responsible for" topBody="1"/>
           <tag name="senior" topBody="1"/>
           <tag body="1" name="managing"/>
           <tag body="3" name="management" topBody="2"/>
       </category>
   </tags>
 </doc>

Page tree

Tagger