For a list of all available token processors, see Token Processors.
For information on programming new token processors, see Creating Token Processors.
Also see the Use Cases below.
...
The basic operation of the Tokenization Manager component is shown in the diagram to the right:
Comments:
There are two types of processing:
Text can be sent into the tokenizer in two ways:
Output can be retrieved from the tokenization pipeline in two ways
Note that an "AspireAnalyzer" works differently than analyzers in Lucene. In Lucene, analyzers are factories for TokenStreams, from which tokens can be retrieved.
...
Change since earlier 0.4-SNAPSHOT versions: Previously, you passed the AspireObject (i.e. the Job's object variable) to newAspireAnalyzer(). This is no longer the case. Now, you always pass the job object, in all circumstances.
Element | Type | Default | Description |
---|---|---|---|
processors | parent tag | none | A parent tag which contains a list of token processors. Note that the first processor listed within <processors> must be a tokenizer, and all of the others are token filters or token statistics aggregators. |
processors/processor | tag with attributes | none | Specifies each individual token processor in the tokenization pipeline. Note that the order of the <processor> tags is important - since the tokens will be processed in the order specified. Each token processor will be called on each token in turn. See below for a list of token processors currently defined. |
processor/@class | string | none | Many "core" component processors are provided with the Tokenization Manager itself. All of these core processors are specified using the Java class name of the processor specified with the @class attribute. See Token Processors to determine which processors are "core" and what Java class name should be used for each. |
processor/@componentRef | string | none | Component processors may also be specified as Aspire components. This is the method used for processors which are created outside the Tokenization Manager (the non-core processors) and is often used if customers require custom token processors for special needs. See below for how these components are configured in Aspire. Once the component is configured, use the @componentRef attribute to specify the Aspire component name (may be relative or absolute) of the processor. |
processor/@scope | string | none | Can be "analyzer", "document", "parent", or "grandparent". Specifies the scope for variables which are created by this token processor. See the discussion of "scope" above. |
processor/@useCases (note: plural 'cases') | string | none | This is a comma-separated list of use cases for which the tokenization processor will be enabled. Each "use case" is a name which can be determined by the configuration. The only use case name which has significance is "default" which is the use case which is enabled when no use case is specified to the newAspireAnalyzer() function. See more information about use cases below. |
processor/* | parent tag | none | Many tokenization processors have additional configuration parameters which can be specified in nested tags within the <processor> tag. See Token Processors for more information about what configuration options are available for each token processor. <p/> Note that configuration can not be specified when referencing a token processor by component name with the @componentRef attribute. |
tagsToProcess | parent tag | none | (Only for use as a pipeline stage) A parent tag which contains a list of nested <tag> elements, which identifies the list of AspireObject XML tags to be sent through the token processing pipeline when this component is used as a pipeline stage. |
tagsToProcess/tag/@name | String | none | Identifies the XML tag from the AspireObject whose text will be sent through the token processors. Multiple tags can be specified. They are processed in order. |
...
If the componentRef is an absolute path, such as "/common/CustomTokenFilter", this allows you to share Aspire token filter factories or Aspire tokenizer factories across multiple instances of the Tokenization Manager.
Anchor | ||||
---|---|---|---|---|
|
...
Complex text analytics applications will often have many different but mostly similar tokenization pipelines. Further, these pipelines can be very long and contains dozens and dozens of stages.
...