Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

For a list of all available token processors, see Token Processors.

For information on programming new token processors, see Creating Token Processors.

Also see the Use Cases below.

How It's Put Together

...

The basic operation of the Tokenization Manager component is shown in the diagram to the right:

 

Relationship between the TokenizationManager and AspireAnalyzer objects

Comments: 

  • The TokenizationManager component produces AspireAnalyzer objects.
  • AspireAnalyzer objects take in string data, tokenize the data, and can then do further processing on the tokens.

There are two types of processing:

    • Token manipulation - filtering, modifying, or splitting tokens
    • Computing summary information - for example, counting tokens, creating token histograms

Inputs to the token processing pipeline

Text can be sent into the tokenizer in two ways:

  1. A groovy stage can call processAll() (or a similar process method) on the AspireAnalyzer to process specific text.
  2. The TokenizationManager can be a pipeline stage, in which selected tags from the job's XML will be automatically sent into the analyzer

Outputs from the token processing pipeline

Output can be retrieved from the tokenization pipeline in two ways

  1. A groovy stage can get a token stream, and can then extract tokens from the pipeline one at a time
    • (some work on this is still TBD)
  2. The results of some token processors - specifically those that create summary statistics, can be automatically stored in variables, either a job's variables or a job's parent's variables (see below)

Aspire Analyzers Are Different than Lucene Analyzers

Note that an "AspireAnalyzer" works differently than analyzers in Lucene. In Lucene, analyzers are factories for TokenStreams, from which tokens can be retrieved.

...

Change since earlier 0.4-SNAPSHOT versions: Previously, you passed the AspireObject (i.e. the Job's object variable) to newAspireAnalyzer(). This is no longer the case. Now, you always pass the job object, in all circumstances.

Configuration

ElementTypeDefaultDescription
processorsparent tagnoneA parent tag which contains a list of token processors. Note that the first processor listed within <processors> must be a tokenizer, and all of the others are token filters or token statistics aggregators.
processors/processortag with attributesnoneSpecifies each individual token processor in the tokenization pipeline. Note that the order of the <processor> tags is important - since the tokens will be processed in the order specified. Each token processor will be called on each token in turn.

See below for a list of token processors currently defined.

processor/@classstringnoneMany "core" component processors are provided with the Tokenization Manager itself. All of these core processors are specified using the Java class name of the processor specified with the @class attribute. See Token Processors to determine which processors are "core" and what Java class name should be used for each.
processor/@componentRefstringnoneComponent processors may also be specified as Aspire components. This is the method used for processors which are created outside the Tokenization Manager (the non-core processors) and is often used if customers require custom token processors for special needs. See below for how these components are configured in Aspire. Once the component is configured, use the @componentRef attribute to specify the Aspire component name (may be relative or absolute) of the processor.
processor/@scopestringnoneCan be "analyzer", "document", "parent", or "grandparent". Specifies the scope for variables which are created by this token processor. See the discussion of "scope" above.
processor/@useCases
(note: plural 'cases')
stringnoneThis is a comma-separated list of use cases for which the tokenization processor will be enabled. Each "use case" is a name which can be determined by the configuration. The only use case name which has significance is "default" which is the use case which is enabled when no use case is specified to the newAspireAnalyzer() function. See more information about use cases below.
processor/*parent tagnoneMany tokenization processors have additional configuration parameters which can be specified in nested tags within the <processor> tag. See Token Processors for more information about what configuration options are available for each token processor.

<p/> Note that configuration can not be specified when referencing a token processor by component name with the @componentRef attribute.

tagsToProcessparent tagnone(Only for use as a pipeline stage) A parent tag which contains a list of nested <tag> elements, which identifies the list of AspireObject XML tags to be sent through the token processing pipeline when this component is used as a pipeline stage.
tagsToProcess/tag/@nameStringnoneIdentifies the XML tag from the AspireObject whose text will be sent through the token processors. Multiple tags can be specified. They are processed in order.

Specifying Tokenization Processors

...

If the componentRef is an absolute path, such as "/common/CustomTokenFilter", this allows you to share Aspire token filter factories or Aspire tokenizer factories across multiple instances of the Tokenization Manager.

Use Cases
Anchor
TokenUseCases
TokenUseCases

...

Complex text analytics applications will often have many different but mostly similar tokenization pipelines. Further, these pipelines can be very long and contains dozens and dozens of stages.

...