You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Language Processing Toolkit processes text through a pipeline of text processing stages. The typical pipeline consists of the following sets of stages:

  • Text Block Reader - Reads a text stream (for example, text from a file, a tweet, or from the user)
    • There can only be one and it must be first.
  • Tokenization - Reads text blocks and divides them into individual tokens
    • Typically only one and usually comes second.
  • Splitters - Divides up tokens, usually based on punctuation
    • This is not part of the tokenizer because sometimes you want process the tokens with intervening punctuation (for example, floating point numbers, things with dashes like "e-mail").
  • Recognizers - Recognizes and flags various types of tokens, for example numbers or things with all punctuation.
  • Normalizers - Does character normalization on tokens, typically adding lower-case versions of any word with upper-case characters.
  • Taggers - Adds semantic tags for things like entities and sentence interpretations.

The pipeline can be specified in a JSON format which can be stored in a resource (see Resources). A sample is shown below:

Sample JSON Pipeline Configuration
{
  "reader": {
    "type": "SimpleReader",
    "splitRegex": "\r\n"
  },
  "stages": [
    { "type": "WhitespaceTokenizer" },
    { "type": "CharacterSplitter" },
    { "type": "com.accenture.saga.engine.stages.CaseAnalysisStage" },
    { 
      "type": "DictionaryTagger",
      "dictionary": "resources-provider:dictionary",
      "required":["TOKEN", "ALL_LOWER_CASE"]
    }
  ]
}

Structure

There are two sections to the pipeline configuration:

  • "reader"
    • Contains the configuration for the text reader (the first stage which reads the text stream and converts it into text blocks to be processed).
  • "stages"
    • Ccontains a list of pipeline stages, each of which 

Stages

Stage configurations are documented with each pipeline stage.

The "type"

The "type" field specifies the Java class which is the pipeline stage. This can be:

  • A fully qualified java package and class name.
    • For example:  com.accenture.saga.engine.stages.CaseAnalysisStage
  • A simple stage name without "Stage" at the end
    • When this occurs, Saga will automatically look in the "com.accenture.saga.engine.stages" package for a class with the same name and with "Stage" appended to the end. 
    • For example:  "DictionaryTagger" will automatically look for "com.accenture.saga.engine.stages.DictionaryTaggerStage"
  • Any other class name (without the java package)
    • Will automatically look for the class in the "com.accenture.saga.engine.stages" package


  • No labels