A key innovation of the Saga Library is that the output of language processing is a graph of alternative representations




What is an "Interpretation Graph"?

Every token in a piece of text could have multiple interpretations. An "Interpretation Graph" is an efficient method for showing all possible interpretations of a piece of text.

As an example, the interpretation graph of "Abe Lincoln likes the Galaxy-8" might look like this:

In this example, we see that:

  • An interpretation of the entire sentence is {person-product-preference}
    • In other words, there's a person who likes a product
  • The {person} is made up of two tokens:  "abe" → "lincoln"
  • The token "lincoln" has a title-case alternative: "Lincoln"
  • The token "likes" has a lemmatized alternative:  "like"

What's not shown in the above diagram are confidence factors, which are tagged on every interpretation.

Interpretation graphs are made from vertices and lexical items

It is this "node and edge" structure which makes this an interpretation graph.

Lexical Items

Can be a text block, token, or semantic tag. Typically, these are important carriers of semantic information.

The structure of a lexical item is as follows

  • Flags - (of type LEX_ITEM) Indicates the properties of the current lexical item. For more information, refer to Flags later in this page.
  • Confidence - Number between 1 and 0 representing the confidence level of this lexical item.
  • Text - The actual text represented by the item.
  • Stage - Stage Name from which this item was created.
  • Back - Pointer to the previous Vertex.
  • Next - Pointer to the following Vertex.
  • Components - if this item was created from the transformation or accumulation of one or more lexical items, they will be referred as a list of lexical items.
  • Entities - (Only for Semantic Tags) List of JSON documents, where each document represents a possible entity.


A Lexical Item will always have a back and next vertex, and just one of each.

Vertex

The junction points between interpretations. Typically, the white-space or punctuation between lexical items.

The structure of a Lexical item is as follows:

  • Flags - (of type VERTEX) Indicates the properties of the current lexical item.  For more information, refer to Flags later in this page.
  • Text - The actual text represented by the vertex (if any).
  • Stage - Stage Name from which this vertex was created.
  • Back - Pointer to a list of previous lexical items.
  • Next - Pointer to a list of following lexical items.

A vertex may or may not always have a back or next list. This may happen if the vertex is the beginning or the end of the graph. In this case the list can be null or empty

Interpretation graphs are "Add Only"

Information can only be added to an interpretation graph. It can never be removed or changed. By this we mean:

  • More alternative paths can be added.
  • More lexical variations can be added.
  • Flags and possibly additional metadata can be added to lexical items and vertices.
    • Flags, once set, can never be un-set.
  • Tokens, text blocks, and semantic tags, once added, can never be removed

This comes from hard experience where we have discovered that, ultimately, "all interpretations are possible". 
When we have implemented these toolkits previously, we have had to make hard choices.
For example, what punctuation splits a token, is upper-case important, do we need to save the original variation or is the root word enough.
In almost all cases the answer is "sometimes" or, occasionally, "almost always".

And so, we never actually remove any interpretations from the graph.  Instead, all interpretations are kept at all times and disambiguation is used to choose which interpretation the application will be most likely to be correct.

Everything is saved

Along with the "add only" approach, we endeavor to save everything. For example:

  • Lexical items contain character buffers of the text for the item.
  • Vertexes contain character buffers of the characters which they cover (e.g. the spaces, punctuation, etc.).

Further, every vertex and lexical item identifies the start and end character position (from the original content stream) which it covers.

Flags

Flags are bits that can be turned on (e.g. 'set') for lexical items and vertexes. Flags are typically used for unambiguous, processing-related functions. Their function is often to control down-stream processing to make the pipelines more efficient.

Once they are set, they can never be un-set (well, frankly, you can actually change the bits at any time, so this is more of an honor-system).

Flags typically identify obvious and unambiguous characteristics of the lexical item and/or vertex. For example lexical item type (TEXT_BLOCK, TOKEN, SEMANTIC_TAG), case (ALL_UPPER_CASE, TITLE_CASE, MIXED_CASE), vertex characters (WHITESPACE, PUNCTUATION), etc.

Flags only describe the Lexical Item itself

It may seem obvious, but flags describe the Lexical Item itself, and do not describe any items from which it was derived.


For example if you have the following graph:

V----[President]----V

And then you apply the Case Analysis Stage to this graph, you will get:

V----[President]----V
^---[president]----^


In this example, the first "President" token will have the TITLE_CASE flag, and the second (normalized) "president" token will have the ALL_LOWER_CASE flag. There is no flag which says "I was derived from some other token which was TITLE_CASE".

You can traverse the component links from the derived item ("president") to the original item ("President") to  determine if some token was original TITLE_CASE.

Semantic Tags

Semantic tags identify (typically) semantic interpretations of sections of the content. This can include anything from entities (like {person}, {place}, etc.) to full sentence interpretation (as in {person-fact-request}, {restrictive-covenant-term}, {language-fluency-statement}, etc.) or possibly more.

Unlike flags (see above), Saga does not pre-define any semantic tags. Instead, semantic tags are determined based on the requirements of the text to be processed.

Specifically:

  • Taggers will add semantic tags for entities
    • For example, to look up names from a dictionary and to tag those names where they occur in the document
  • Advanced pattern recognizers will identify combinations of tags and literal text and create new tags
    • They are called "advanced" because they allow for patterns which have nested and recursive tagging

Semantic Tags will be ambiguous

A key philosophy of Saga is that ambiguity is embraced rather than dreaded. To this end, the system will generate all possible semantic tags, including many and various ambiguous alternatives.

Confidence values

All lexical items will have a confidence value, which describe the confidence of the interpretation. This is key for semantic tags where the confidence value can initially come from external sources (e.g. the likelihood of a entity occurring randomly) and then will build up based on context and how the entity participates in larger patterns.

In addition, patterns can be generated by statistical techniques and then entered into the system. Systems which generate patterns in this way are encouraged to include a confidence value which then is then combined with the confidence of the supporting parts to generate a confidence value for every interpretation.

Confidence can be strengthened with context

Finally, it is the intention that confidence can be further strengthened with external confidence models. This allows for semantic tags to include or be linked to contextual clues which, when found in nearby text, will help provided the needed context.

Using the Output

The output of the processing engine will be an interpretation graph with confidence values. It is expected that the application will:

  • Scan through the output
  • Decide (using business rules and confidence factors) which interpretation to accept
  • Use the "components" links to identify all of the text which went into the interpretation
  • Do something with the output


  • No labels