Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

For a list of all available token processors, see Token Processors.

For information on programming new token processors, see Creating Token Processors.

Also see the Use Cases below.

How It's Put Together

...

The basic operation of the Tokenization Manager component is shown in the diagram to the right:

 

Relationship between the TokenizationManager and AspireAnalyzer objects

Comments: 

  • The TokenizationManager component produces AspireAnalyzer objects.
  • AspireAnalyzer objects take in string data, tokenize the data, and can then do further processing on the tokens.

There are two types of processing:

    • Token manipulation - filtering, modifying, or splitting tokens
    • Computing summary information - for example, counting tokens, creating token histograms

Inputs to the token processing pipeline

Text can be sent into the tokenizer in two ways:

  1. A groovy stage can call processAll() (or a similar process method) on the AspireAnalyzer to process specific text.
  2. The TokenizationManager can be a pipeline stage, in which selected tags from the job's XML will be automatically sent into the analyzer

Outputs from the token processing pipeline

Output can be retrieved from the tokenization pipeline in two ways

  1. A groovy stage can get a token stream, and can then extract tokens from the pipeline one at a time
    • (some work on this is still TBD)
  2. The results of some token processors - specifically those that create summary statistics, can be automatically stored in variables, either a job's variables or a job's parent's variables (see below)

Aspire Analyzers Are Different than Lucene Analyzers

Note that an "AspireAnalyzer" works differently than analyzers in Lucene. In Lucene, analyzers are factories for TokenStreams, from which tokens can be retrieved.

...

If the componentRef is an absolute path, such as "/common/CustomTokenFilter", this allows you to share Aspire token filter factories or Aspire tokenizer factories across multiple instances of the Tokenization Manager.

Use Cases
Anchor
TokenUseCases
TokenUseCases

...

Complex text analytics applications will often have many different but mostly similar tokenization pipelines. Further, these pipelines can be very long and contains dozens and dozens of stages.

...