Page History

For a list of all available token processors, see Token Processors.

For information on programming new token processors, see Creating Token Processors.

How It's Put Together

The basic operation of the Tokenization Manager component is shown in the diagram to the right:

Image Modified

Comments:

The TokenizationManager component produces AspireAnalyzer objects.

...

Change since earlier 0.4-SNAPSHOT versions: Previously, you passed the AspireObject (i.e. the Job's object variable) to newAspireAnalyzer(). This is no longer the case. Now, you always pass the job object, in all circumstances.

Configuration

Element	Type	Default	Description
processors	parent tag	none	A parent tag which contains a list of token processors. Note that the first processor listed within <processors> must be a tokenizer, and all of the others are token filters or token statistics aggregators.
processors/processor	tag with attributes	none	Specifies each individual token processor in the tokenization pipeline. Note that the order of the <processor> tags is important - since the tokens will be processed in the order specified. Each token processor will be called on each token in turn. See below for a list of token processors currently defined.
processor/@class	string	none	Many "core" component processors are provided with the Tokenization Manager itself. All of these core processors are specified using the Java class name of the processor specified with the @class attribute. See Token Processors to determine which processors are "core" and what Java class name should be used for each.
processor/@componentRef	string	none	Component processors may also be specified as Aspire components. This is the method used for processors which are created outside the Tokenization Manager (the non-core processors) and is often used if customers require custom token processors for special needs. See below for how these components are configured in Aspire. Once the component is configured, use the @componentRef attribute to specify the Aspire component name (may be relative or absolute) of the processor.
processor/@scope	string	none	Can be "analyzer", "document", "parent", or "grandparent". Specifies the scope for variables which are created by this token processor. See the discussion of "scope" above.
processor/@useCases (note: plural 'cases')	string	none	This is a comma-separated list of use cases for which the tokenization processor will be enabled. Each "use case" is a name which can be determined by the configuration. The only use case name which has significance is "default" which is the use case which is enabled when no use case is specified to the newAspireAnalyzer() function. See more information about use cases below.
processor/*	parent tag	none	Many tokenization processors have additional configuration parameters which can be specified in nested tags within the <processor> tag. See Token Processors for more information about what configuration options are available for each token processor. <p/> Note that configuration can not be specified when referencing a token processor by component name with the @componentRef attribute.
tagsToProcess	parent tag	none	(Only for use as a pipeline stage) A parent tag which contains a list of nested <tag> elements, which identifies the list of AspireObject XML tags to be sent through the token processing pipeline when this component is used as a pipeline stage.
tagsToProcess/tag/@name	String	none	Identifies the XML tag from the AspireObject whose text will be sent through the token processors. Multiple tags can be specified. They are processed in order.

Specifying Tokenization Processors

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key

How It's Put Together

Configuration

Specifying Tokenization Processors