Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The overall structure of an Saga program is shown in the diagram below:

Image RemovedImage Added

...

A Saga

...

engine is a

...

pipeline of text processing stages

  1. The first stage in the pipeline is a "reader"
    • This reads raw text from a text stream and returns it as text blocks to be processed by the stages.

2. Then there

...

is a list of pipeline stages.

3. The result is the final interpretation graph of text blocks, tokens, and semantic tags.

4. Different orders will result in different results

...

What is an 'interpretation graph' ?

Every token in a piece of text could have multiple interpretations. An "interpretation graph" is an efficient method for showing all possible interpretations of a piece of text.

As an example, the interpretation graph of "Abe Lincoln likes the iPhone-8" might look like this:

Image Removed

In this example, we see that:

  • An interpretation of the entire sentence is {person-product-preference}.
    • In other words, there's a person who likes a product
  • The {person} is made up of two tokens:  "abe" → "lincoln"
  • The token "lincoln" has a title-case alternative: "Lincoln"
  • The token "likes" has a lemmatized alternative:  "like"

What's not shown in the above diagram are confidence factors, which are tagged on every interpretation.

Interpretation Graphs are made from Vertexes and Lexical Items

  • Lexical Items - Can be a text block, token, or semantic tag
    • Typically important carriers of semantic information
  • Vertexes - Are the junction points between interpretations
    • Typically the white-space or punctuation between lexical items

It is this "node and edge" structure which makes this an interpretation graph.

Interpretation Graphs are "Add Only"

Information can only be added to an interpretation graph. It can never be removed or changed. By this we mean:

  • More alternative paths can be added
  • More lexical variations can be added
  • Flags (and possibly additional metadata - TBD) can be added to lexical items and vertexes
    • Flags, once set, can never be un-set.
  • Tokens, text blocks, semantic tags, once added, can never be removed

This comes from hard experience where we have discovered that, ultimately, "all interpretations are possible". When we have implemented these toolkits previously, we have had to make hard choices. For example, what punctuation splits a token, is upper-case important, do we need to save the original variation or is the root word enough. In almost all cases the answer is "sometimes" or, occasionally, "almost always".

.


Resources

Resources are any of the data structures which typically support an engine like this. This includes pipeline configurations, dictionaries, pattern databases (perhaps from text mining), machine learning models, etc.

Resource Providers

Resources are provided by "resource providers" which insulate the pipeline stages from having to know the details of the underlying storage technology. Example providers are:  "FileSystem" and "Elasticsearch".

Resource providers are configured in the "config.json" configuration file.  It contains a "providers" section with parameters for each provider such as server connection strings, username, password, base directory path, etc.

A Key Design Goal:  Changing the storage location of a resource will not require changing the pipeline configuration.

For example, you might first develop your NLP program using simple files. But then you move it to a No-SQL database so you have real-time updates. The same pipeline configuration should work in both placesAnd so, we never actually remove any interpretations from the graph. Instead, all interpretations are kept at all times and disambiguation is used to choose which interpretation the application will be most likely to be correct.