1. Introduction

Saga is a comprehensive, easy-to-use middleware for maintainable and scalable Natural Language Understanding. It has automated pipeline construction, state-of-the-art handling of language ambiguity, integrated machine learning and business-friendly user interfaces for creating and maintaining language models at a reasonable cost. In addition to its out-of-the-box algorithms, Saga allows the usage of custom Python models. It is used by our customers to implement solutions like text/entity extraction, semantic search, text classification, extraction of knowledge graph relationships, question/answering, analytics on unstructured content, etc.


For a better understanding of what Saga is and what the purpose of the UI is, please check out this presentation:



2. Exploring the UI

The UI is pretty simple. It has a main tab selector at the top where you can determine where you want to work (Tags, Evaluate, Pipelines, Datasets, Rules/Executors or Background Processes).

Login

A Login page can be seen when security is enabled in the config file, it uses basic authentication against the user and password in the configuration


Tags

Users can define semantic tags to be used and the recognizers and settings that each tag will use.

But, what is a Semantic Tag anyway?

Semantic Tags are the organizing structure in Saga, they identify and interpret regions of text. They are basically anything you want to identify in text and you can name them whatever you want as long as it makes sense to you.  For example, let's say I want to identify Emails in a document I'm processing with Saga, then I can create a tag named "email" or maybe "contact-email" or "eMail". This tag name will be used by Saga to show in the results where emails exists in text.



536px1200

Evaluate

Evaluate tab is used when the user needs to test several tags at once. It could be a quick test using the Preview functionality or starting a Test Run against a dataset.  Test runs started in this screen are called 'Evaluations' because they include some statistics that can be used to compare 2 Evaluation runs that use different settings.


For each evaluation you can open the Search Interface or open detailed statistics:

In case you want to delete an Evaluation, it can be done in the 'Background Processes' tab.

Pipelines

Users can add new, delete or update pipelines.  Pipelines defines which stages are included in the pipeline before recognizer stages are added to the pipeline.  So for example, you could have stages such as white space tokenizertext case analyzer or a stop words identifier.

The user has the ability to move stages around using up and down buttons:

Also stages can be inserted at the desired position using the context menu:


Datasets

Users can view the datasets loaded into the application to perform test runs and/or training of machine learning models.  Define which fields to process from the dataset file and how to split the text in order to feed the pipeline.  (Since users cannot upload datasets at the moment, datasets need to be placed in a special folder in the Saga Server file system.)


Rules/Executors

As of 1.3.3 this functionality is deprecated until integration with SearchAPI is done.

Saga can be the engine behind semantic search, as it is for the ESUI (Enterprise Search User Interface). The user can create rules and define what rules will be executed for each tag that Saga identifies in a query, through the configuration done in this tab.

Saga provides integration with a custom ESUI.




Background Processes

Users can monitor background processes that are running.  For example, when running a test run against a dataset, the process could take a long time to complete, so the user can check the progress in this screen. 


Administration Tools

Export & Import

Users can Export and Import all the data in Saga to .sg folder for backup or to import in another Saga

Environment

The environment tool will create and download a file with the current environment conditions, like RAM, CPU, hard drive space,...

Here is an example of the data the file will have

sagaEnv.json
{
  "add-ons": {
    "processors": [
      "GoogleKnowledgeStage:1.3.1",
      "NamePredictorStage:1.3.1",
      "FaqStage:1.3.1",
      "ClassificationStage:1.3.1",
      "GoogleEntityPredictorStage:1.3.1"
    ],
    "recognizers": [
      "GoogleKnowledgeStage:1.3.1",
      "NamePredictorStage:1.3.1",
      "FaqStage:1.3.1",
      "ClassificationStage:1.3.1",
      "GoogleEntityPredictorStage:1.3.1"
    ]
  },
  "java": "11.0.2",
  "os": "Windows 10",
  "elasticsearch": "7.4.2",
  "cpu": {
    "usage (%)": {
      "jvm": 7.678,
      "system": 0.0
    },
    "processors": 12,
    "arch": "amd64"
  },
  "memory (Mb)": {
    "jvm": {
      "committed": 8377,
      "using": 4096,
      "max": 10240
    },
    "system": {
      "total": 32503,
      "using": 15610,
      "free": 16893
    },
    "swap": {
      "total": 42743,
      "using": 27392,
      "free": 15351
    }
  },
  "config": "C:\\Saga\\il Master\\saga-server\\config\\config.json",
  "version": "1.3.1",
  "version-date": "2020-03-05T10:14:59.561-06:00"
}


GPT3 Proxy

The GPT3 Proxy tool allows the use of OpenAI libraries to create text and search in large documents. This page has more information for this tool.


DXF Playground

The DXF Playground helps the user the creation of UI elements which can be rendered on different parts of SAGA. This page has more information for this tool.

Search Interface

Users can review results from a Test Run.  The flow will be something like: 

  1. Add a semantic tag. 
  2. Add recognizers to the tag and configure them.
  3. Test the effectiveness of the tag and its recognizers by running a test run against a dataset file.
  4. To review results, open the Search interface to see how well (or poorly) text was tagged.

In case the user recognizes an entity in the text that is not being recognized by Saga it is possible to add it to the tag's entity recognizer dictionary by selecting the text and clicking the add button like this:


Add to dictionary functionality is only available for test runs that use 1 tag. So it won't appear for Evaluation runs.

3. Using the UI

3.1. End to end use case

In this section, we'll go through the process of creating a set of tags, adding some recognizers to them and testing how they perform against a dataset.  This will give the user a better idea of the process/flow when using the Saga UI.

We currently have a dataset loaded into Saga about aviation incidents.  We will try to identify incidents where an engine catches fire due to a bird collision.

Step #1: Check and choose your base pipeline

An important consideration is which stages you want to include in the base pipeline used by recognizers.  Pipelines usually have a set of stages to pre-process text before passing it to the recognizers.

As you can see in the following image, we'll use the baseline-pipeline that has the following stages:

  1. Sentence BreakerSplit paragraphs into sentences by identifying common characters used as sentence breakers.
  2. WhiteSpaceTokenizer:  Splits sentences into words using the white space as a separator.
  3. StopWords:  Identifies very common words that do not add any value to the process. (For example, 'the', 'a', 'this', 'those', etc.)
  4. CaseAnalysis:  Identifies whether or not a word is all UPPERCASE, lowercase and mixed casings and then converts text to lowercase. This is usually used to normalize words so that they match easily when creating patterns in recognizers.
  5. CharChangeSplitter:  Separates tokens based on character changes (from lowercase-uppercase, letter-number, alphanumeric-punctuation) without taking any character in the vertex, and respecting the capital letter.


Step #2: Create basic tags you will need

We want to identify three things:

  • birds
  • fire
  • engine

If those three things are present in an incident report, then we could hypothesize that the incident is about engines catching fire due to bird collisions.

  1. So lets start by creating the {bird} tag:


Each tag has functionalities:

     Add Child: Allows the user to add a child tag

     Rename: Allows the user to change the name of the tag incase misspelled.

     Delete: Allows the user to delete a tag. Just be careful when deleting parent tags because the delete functionality will perform a cascading delete for all child tags.

     Find Tag Usage: Find if the selected tag was used on other tags. (For Example: when tag is used in Advance Recognizer).

    Export: Allows you to export the tag in .sg file.

    Cut: Allows you to cut and paste the tag on other tags.






2. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:

 

3. Add the following patterns to the entity recognizer: duck, ducks


4. Repeat the steps for each of the following patterns (you can also add their plural form as additional patterns):

    • eagle
    • hawk
    • seagull
    • duck



Check the settings for the Entity Recognizer, you may need to remove the ALL_LOWER_CASE flag from the 'Required flags' setting in order for the matching to work:




5. Add the Simple Regex recognizer the same way you added the Entity Recognizer, and then add the following regex:

    • bird[s]?


Steps to add simpleRegex patterns are very similar to how entities were added in the Entity recognizer.

We are using regex for this one just for demonstration purposes. You should just add the bird entity to the Entity recognizer dictionary, it is faster than regex.



6. Now, do the same for fire and engine tags:


Step #3: Add the {fire-by-bird} tag that will use the other tags

The idea here is to create a tag that will use {fire}{engine} and {bird} to identify a concept which is that the engine caught fire due to bird collisions.  For this special tag, we'll use the Fragmented recognizer. This is an advanced recognizer that tags text that contains the other three tags in any order of appearance and that are close enough to one other within the aviation report.

  1. Create a tag called fire-by-bird.  Use similar steps to create the other tags.
  2. Attach the Fragmented recognizer to the tags.
  3. Add the following pattern: {fire} {engine} {bird}.  Make sure to select the In Any Order check box; Max tokens at 10 and Min tokens at 4.


4. For our feature example, make sure to remove the 'Sentence Breaker' stage from the 'Base Pipeline' in case it is present.



The 'Sentence Breaker' stage processes a text block and splits it in sentences using the language configured. It will use punctuation and language specifics to achieve this but in addition you can specify a list of additional 'breaker' characters. 

For our example, we don't need the aviation incident to be split into sentences because each incident is a small portion of text.  In case it has several sentences, they are all related to the same incident.


5. In Saga, by default the 'Whitespace Tokenizer' expects text blocks containing the 'SENTENCE' flag which is previously set by the 'Text Breaker' stage. Because we removed the 'Text Breaker' stage, we need to remove the 'SENTENCE' flag from the required list of flags in the 'Whitespace Tokenizer' stage configuration:


6. Make sure all of the recognizers of all your tags are using the same pipeline (or the pipeline you need it to be).

Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline'.


        7. You can also import patterns from different data sources (For example JSON, JSONL, XML, XLSX and CSV).


Step #4: Quick test using the preview

Test any of your tags using the preview functionality.  For example, let's test the {fire-by-bird} tag.

  1. Make sure to click on it in the Tag tree.
  2. Then enter the following text into the preview text box: "SEAGULL STRIKE INTO TURBINE ON TAKEOFF SEVERE VIBRATION AND FLAME."

A dialog with the Saga graph is displayed.  Note how the {bird}, {fire}, {engine} and also the {fire-by-bird} tags are identified in the text.

Step #5: Perform a test run with a dataset

Once you have tested the performance of your tags using the preview, then it might be a good idea to test it against bigger text.

You can create your own and upload them to a special folder in the Saga file system (learn how to do it in the Datasets article)

  1. Inside the {fire-by-bird} tag, select Start Test Run and then select the "--- New Test Run ---" option.
  2. Select the dataset check box and Execute.


3. Select the Background Processes tab to review the progress of the run.

4. Wait for the test run to complete or if you cannot wait you can click on the "Open search" button to see partial results while running.

5. When process is complete, click on Open search to open the search interface. In this screen, you will find your tags as facets.  When selected, you'll see search results containing your tags.

In the following image, we are clearing facets and then selecting only {fire-by-bird} to check the comments that talk about engines catching fire due to bird collisions. 


6. After reviewing the results, you can continue iterating on a process of reviewing results and tweaking your tags and pipelines to create the best model to use for your specific use case.

3.2 Machine learning recognizers

3.2.1 Name entity recognizer

The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained).  In addition, the recognizer can be used with other recognizers to train a new model.

(If you need the model, Go to OpenNLP Models and look for en-ner-person.bin)

3.2.1.1 Using it as recognizer

In order to use it as a recognizer:  Add the recognizer to your tag, choose a model, select the probability threshold used to decide if something is a match or not, and add normalization tags in case you want to cleanse and normalize the input.

Let's use the {human} tag to test this functionality:

  1. Select the {human} tag.
  2.  Attach the Name Entity recognizer to the {human} tag.
  3. Choose a model (Instructions on how to get a model are here).  This model has been trained to identify English names of people. Note that models might be case sensitive so in the recognizer configuration change the base-pipeline to a stage before the Case Analysis one.





6. Enter the following text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph and Paul are two of them".  As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}:

3.2.1.2 Using it as a trainer

In order to train a model, use another recognizer as the base and a dataset that has a good sample of the desired values to identify. In this case, we'll use the 'CFR-2018' dataset which contains regulations from the government. 

(You can get this dataset going to Documents/General/Saga Datasets inside the Microsoft Teams space for SAGA&ESUI here.)

  1. Create a new tag called {emissions-equipment}. Attach the Entity recognizer and add the following patterns:
    • vehicle
    • locomotive
    • truck
    • marine engine
    • tanker truck
    • engine

2. Attach the Name recognizer to your tag and click Train.

3. Select the 'CFR-2018-title40' dataset and then Execute.


4. Check out the Background Processes tab to review the progress of the training.

5. Once the process completes:

    • Go to your tag and disable the Entity recognizer.
    • In the Name recognizer, select your recently created model (result from the training).  It should be something like "emissions-equipment-[date stamp here]", For example: 'emissions-equipment-20190206172621'. 
    • You can also use the option --LATEST-- so it will always use the most recent model you have created.
    • Set the 'Minimum Probability' field to 0.5
    • Run a TestRun against the CFR dataset.
    • Once the test run finishes, use the Search interface to check if any text was tagged with the emissions-equipment tag.   If you do see text tagged, then the model you trained is working.

3.2.2 Classifier recognizer

The Classifier recognizer is used to perform binary classification of sentences. It uses Apache OpenNLP internally and can be used as a recognizer and as a trainer (just like the Name Entity recognizer).

The difference between the Classifier recognizer and the Name Entity recognizer is that Name Entity is used to identity entities; it will tag a word.  The classifier will tag an entire sentence and may use other algorithms not available in the Name Entity recognizer.

3.2.2.1 Using it as recognizer

  1. In order to use it as a recognizer, attach the recognizer to your tag and then select a model from the list.


2. Then you can test its performance by running a test run.

3.2.2.2 Using it as a trainer

Training is the same as for the Name Entity recognizer.  We need another recognizer to use as base, and a dataset with a good quantity of samples of the text we want to classify.

The following steps describe how to do a training.  We'll use the 'Aviation' dataset and will try to tag sentences that talk about incidents with birds. 

  1. Let's reuse the {bird} tag that we created previously, and create a new one called {hit} with the following patterns:



2. Create a new tag called {bird-incident}.

3. Attach the Fragmented recognizer and add the following pattern:

4. Add the Classification recognizer to the {bird-incident} tag, and select '--NONE–' in the 'Model' field.


Always remember to set this field to --NONE-- when training.

5. Select Train.  When the dialog opens:

    • Select the 'Aviation' dataset.
    • Select 'N-Gram' in the 'Feature Selection' field.
    • Increase the field 'max n-gram' to 3.
    • Select Execute.

So this will train a model using the Aviation dataset using the pattern in the Fragmented recognizer.



6. Check out the 'Background Processes' tab and wait for the Classification training to complete.

7. Once done, go back to the {bird-incident} tag and disable the Fragmented recognizer.

8. In the Classification recognizer, select your latest created model in the 'Model' field.  It should be named like: bird-incident-[datetime stamp here].bin.  For example: 'bird-incident-20190208173305.bin'. 

You can also use the option '--LATEST–' to always use the latest trained model.

9. Start a test run using the Aviation dataset.

10. Check out the Background Processes tab for completion. 

11. When complete, select Open Search to check results in the Search screen.


As you can see, the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds.  For this case, because the dataset is small and positive samples identified by the Fragmented recognizer were not that much, the Classification recognizer is not doing a very good job of identifying the sentences.  It is expected that with more and better training data, the accuracy of the Classification recognizer will improve.

You can also play with various training settings to determine which ones generate better results for your specific use case.

3.3 Other Recognizers

This section describes Saga recognizers that were not used in any of the examples explained in previews sections.

3.3.1 Best bets recognizer


This recognizer is part of the functionality that Saga has in place to implement semantic search.

Best Bets featured in most search interfaces will highlight important information when certain keywords are detected in the query done by the user.  Best Bets hits are not coming from the search engine itself but from a curated list of hits.

So let's say that when someone searches for "how to handle fire in a plane" best bets will detect the keywords 'fire' and 'plane' and therefore will show before the search results a link to the company's 'How to handle fire' manual. This manual is the official and recommended source of information for those specific cases.


Therefore the idea behind  the Best Bets recognizer is to keep the curated list of best bets in Saga and then provide the information to ESUI when a query has been tagged with a best bet tag.

How to use it?

  1. Add a new tag called {fire-manual}
  2. Add the 'Bestbets' recognizer to the tag

3. Add a new best bet pattern. This record will contain the following information:

  • Patterns: words to match in the query send to Saga.  In our case we enter the word 'fire'.
  • Title: this text will appear as the hit title in the search results page
  • Description: this text will appear as the hit description in the search results page
  • URL: the URL to navigate to when the user click the title in the search results page
  • Use partial matching: If true, when a pattern is composed of several words, the matching will only use a percentage of the words present in the pattern.  This percentage can be configured in the recognizer settings, by default is set to 50%.



4. Now that the tag and recognizer are correctly set we need to create a rule and executor that ESUI will run to properly show the best bet in the search results page.

4.1 Go to Rules/Executors tab

4.2 Click on the Executors sub tab

4.3 Add the following code in the "Process" code section.

Please notice:

  • The first parameter in the _saga.response method, 'result', is the data the bestBets widget needs to work properly
  • The second parameter needs to be set to 'bestBets', this is how we tell ESUI what widget to use to display best bets information. 


const result = {
   title: data.title,
   description: data.description,
   url: data.url,
};

return _saga.response(result, 'bestBets')


Your screen should look like this:

4.4 Click on the 'Rules' sub tab and add a new rule that uses the executor we created in the previous step:

4.5 Add the {fire-manual} tag in the ESUI saga endpoint configuration ('tags' property)

4.6 When querying something with the word 'fire' which matches our best bet recognizer, your should see a best bet hit in the search results in ESUI: