The Text Tagger stage scans the document for the occurrence of words/phrases and adds counts of the phrases to the document <tags> element. It uses a number of "tag files", containing lists of phrases to match and optionally their synonyms. The output will count the occurrences of the tags and synonyms and output XML. Outputs are per tag file, so the tagger can tag documents against differing lists and maintain separation. The tagger tags the content element by default, but can tag multiple fields from the AspireObject. Optionally, in the case of a field that is marked as the document body, the tagger will separately count occurrences of words with in a certain distance of the start of the text, allowing subsequent stages to bias on proximity to the start of the text

Tagger
Factory Namecom.searchtechnologies.aspire:aspire-tag-text
subType

default

Inputs<content> (by default), but potentially any field from the AspireObject and one or more tag files
Outputs<tags> in the AspireObject

Configuration



ElementTypeDefaultDescription
outputStringtagsThe base output element for the tags in the Aspire document.


Configuration of Content to Tag

By default, the tagger processes the <content> tag from the Aspire document. However, it is possible to configure it to tag other fields.

ElementTypeDefaultDescription
tagFields/tagField/@fieldStringNone (must be specified)The field to process.
tagFields/tagField/@isBodybooleanfalseFlag to indicate this field contains the document body.
tagFields/tagField/@startTokensint0For the document body only, the number of tokens from the start of the text to consider separately as being near the document start.

Note:

  • If no <tagField> tags exist, the tagger defaults to processing the <content> element as the document body.
  • More than one <tagField> element may be used, to tag more than one field.
  • If any <tagField> tags are used, you MUST specify ALL fields to tag, including the <content> if required.

Tag List Configuration

The tagger requires at least one tag list in order to tag files. This text file contains a list of phrases and their synonyms. Each phrase should appear on a new line and any synonyms should appear on subsequent lines, preceded with a + symbol.

ElementTypeDefaultDescription
tagLists/tagList/@idStringNone (must be specified)An identifier for the tags. This will be output in the Aspire Document against any tags from this file that are identified.
tagLists/tagList/@tagFileStringNone (must be specified)The path to the file containing the tags. Relative to $ASPIRE_HOME.


Note:  More than one <tagList> may be specified.

Example tag file

UK
	+United Kingdom
	+Great Britain
	+Wales
	+Scotland
	+England
	+English
	+British
	+Briton
	+Scottish
	+Welsh

USA
	+United States of America
	+United States
	+America

Tokeniser Configuration

By default, the tagger uses the classes org.apache.lucene.analysis.standard.StandardTokenizer and org.apache.lucene.analysis.LowerCaseFilter to tokenize and lowercase the document text and phrases from the tag files. However, these may be overriden if required.

ElementTypeDefaultDescription
tokenProcessing/tokenizer/@classStringorg.apache.lucene.analysis.standard.StandardTokenizerString representing the class to use for the tokeniser. Must conform to the parameters/return type of org.apache.lucene.analysis.standard.StandardTokenizer.
tokenProcessing/tokenizer/@jarStringNone (built in)Jar file the tokenizer class file exists in. Relative to $ASPIRE_HOME.
tokenProcessing/tokenFilter/@classStringorg.apache.lucene.analysis.LowerCaseFilterString representing the class to use as a filter. Must conform to the parameters/return type of org.apache.lucene.analysis.LowerCaseFilter.
tokenProcessing/tokenFilter/@jarStringNone (built in)Jar file the token filter class file exists in. Relative to $ASPIRE_HOME.

Note:

  • More than one token filter may be used
  • The element may contain further attributes. If the configured class implements the AspireInitializer interface, this will be called, and the config element of the component called. This allows the classes to be initialised with any required information

Example Configurations

Simple

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <tagLists>
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
   </tagLists>
 </component>

Complex

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <output>tags</output>
   <tagFields>
     <tagField field="title"/>
     <tagField field="content" isBody="true" startTokens="20"/>
   </tagFields>
   <tagLists>
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
     <tagList id="hr" tagFile="data/tagFiles/hr.txt"/>
     <tagList id="sport" tagFile="data/tagFiles/sports.txt"/>
   </tagLists>
   <tokenProcessing>
     <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizer"/>
     <tokenFilter class="org.apache.lucene.analysis.LowerCaseFilter"/>
     <tokenFilter jar="lib/aspire-lemmatizer.jar" class="org.apache.lucene.analysis.LemmatizerFilter" dictionary="data/dict/gcide_out.xml"/>
   </tokenProcessing>
 </component>

Even More Complex

If the above kinds of configurations are not enough for your needs, use a full text tokenization pipeline by setting up a Tokenization Manager, and then adding all the token filters desired. The tag lists will be handled by Extractor stages.

Example Output

The following is a sample output for the tagger.

Note: Does not show synonyms or sub-tags (will add an example for that later).

 <doc>
   .
   .
   .
   <tags source="textTagger">
       <category name="responsibility">
           <tag body="1" name="administrator"/>
           <tag body="1" name="responsible for" topBody="1"/>
           <tag name="senior" topBody="1"/>
           <tag body="1" name="managing"/>
           <tag body="3" name="management" topBody="2"/>
       </category>
   </tags>
 </doc>
  • No labels