Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

For a list of all available token processors, see Token Processors.

 

For information on programming new token processors, see Creating Token Processors.

How It's Put Together

 

The basic operation of the Tokenization Manager component is shown in the diagram to the right:

Relationship between the TokenizationManager and AspireAnalyzer objectsImage Modified

Comments:

  • The TokenizationManager component produces AspireAnalyzer objects.

...

Change since earlier 0.4-SNAPSHOT versions: Previously, you passed the AspireObject (i.e. the Job's object variable) to newAspireAnalyzer(). This is no longer the case. Now, you always pass the job object, in all circumstances.

Configuration

ElementTypeDefaultDescription
processorsparent tagnoneA parent tag which contains a list of token processors. Note that the first processor listed within <processors> must be a tokenizer, and all of the others are token filters or token statistics aggregators.
processors/processortag with attributesnoneSpecifies each individual token processor in the tokenization pipeline. Note that the order of the <processor> tags is important - since the tokens will be processed in the order specified. Each token processor will be called on each token in turn.

See below for a list of token processors currently defined.

processor/@classstringnoneMany "core" component processors are provided with the Tokenization Manager itself. All of these core processors are specified using the Java class name of the processor specified with the @class attribute. See Token Processors to determine which processors are "core" and what Java class name should be used for each.
processor/@componentRefstringnoneComponent processors may also be specified as Aspire components. This is the method used for processors which are created outside the Tokenization Manager (the non-core processors) and is often used if customers require custom token processors for special needs. See below for how these components are configured in Aspire. Once the component is configured, use the @componentRef attribute to specify the Aspire component name (may be relative or absolute) of the processor.
processor/@scopestringnoneCan be "analyzer", "document", "parent", or "grandparent". Specifies the scope for variables which are created by this token processor. See the discussion of "scope" above.
processor/@useCases
(note: plural 'cases')
stringnoneThis is a comma-separated list of use cases for which the tokenization processor will be enabled. Each "use case" is a name which can be determined by the configuration. The only use case name which has significance is "default" which is the use case which is enabled when no use case is specified to the newAspireAnalyzer() function. See more information about use cases below.
processor/*parent tagnoneMany tokenization processors have additional configuration parameters which can be specified in nested tags within the <processor> tag. See Token Processors for more information about what configuration options are available for each token processor.

<p/> Note that configuration can not be specified when referencing a token processor by component name with the @componentRef attribute.

tagsToProcessparent tagnone(Only for use as a pipeline stage) A parent tag which contains a list of nested <tag> elements, which identifies the list of AspireObject XML tags to be sent through the token processing pipeline when this component is used as a pipeline stage.
tagsToProcess/tag/@nameStringnoneIdentifies the XML tag from the AspireObject whose text will be sent through the token processors. Multiple tags can be specified. They are processed in order.

Specifying Tokenization Processors

...