Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

1.

...

 Introduction

 For a better understanding of what Saga is and what the purpose of the UI is, please check out

...

...

 in Teams:

Office Powerpoint
nameSaga Introduction - User Manual.pptx
width100%


2. Exploring the UI

The UI is pretty simple

...

. It has a main tab selector at the top where you can

...

determine where you want to work

...

(Tags, Pipelines, Datasets or Background Processes).

Tags

Users can define semantic tags to be used, and the recognizers and settings that

...

each tag will use.

Image Modified


Pipelines

...

Users can add new, delete

...

or update pipelines. Pipelines defines

...

which stages

...

 are included in the pipeline

...

before recognizer

...

 stages are added to the pipeline. So for example, you could have stages

...

such as white space tokenizer, text case analyzer

...

 or a stop words identifier.

Image Modified

Datasets

...

Users can view

...

the datasets

...

 loaded into the application to perform test runs and/or training of machine learning models.

...

Define which fields to process from the dataset file and

...

how to split the text in order to feed the pipeline.

...

(Siince users cannot upload datasets at the moment,

...

datasets need to be placed in a special folder in the Saga Server file system.)

Image Modified




Background Processes

...

Users can monitor background processes that are running. For example, when running a test run against a dataset,

...

the process could take a long time to complete, so the user can check the progress in this screen. 

Image Modified

Search Interface

...

Users can review results from a Test Run.

...

The flow will be something like: 

...

  1. Add a semantic

...

  1. tag. 
  2. Add recognizers to

...

  1. the tag and

...

  1. configures them.

...

  1. Test the effectiveness of the

...

  1. tag and its recognizers by running a test run against a dataset file.

...

  1. To review results,

...

  1. open the search interface to

...

  1. see how well (or

...

  1. poorly) text was tagged.

Image Modified

3. Using the UI

3.1. End to end use case

In this section, we'll go through the process of creating a set of tags,

...

adding some recognizers to them and

...

testing how they perform against a dataset. This will give the user a better idea of

...

the process/flow

...

when using the Saga UI.

We currently have a dataset loaded into Saga about aviation incidents.  We will try to identify incidents where

...

an engine

...

catches fire due to a bird collision.

Step #1: Check and choose your base pipeline

...

An important

...

consideration is

...

which stages you want to include in the base pipeline used by recognizers. The base pipeline usually has some stages to pre-process text before passing it to the recognizers.

As you can see in the following image, we'll use the baseline-pipeline

...

that has the following stages:

  1. WhiteSpaceTokenizer:

...

  1.   Splits sentences into words using the white space as a separator.
  2. StopWords:

...

  1.   Identifies very common

...

  1. words that do not add any value

...

  1. to the process.

...

  1. (For example, 'the', 'a', 'this', 'those', etc.)
  2. CaseAnalysis:

...

  1.   Identifies whether or not a word is all UPPERCASE or lowercase and then

...

  1. converts text to lowercase. This is

...

  1. usually

...

  1. used to normalize words so that they match easily when

...

  1. creating patterns in

...

  1. recognizers.
  2. CharChangeSplitter:

...

  1.   Separates tokens based on character changes (from lowercase-uppercase, letter-number, alphanumeric-punctuation

...

  1. ) without taking any character in the vertex, and respecting the capital letter.


Image Modified

Step #2: Create basic tags you will need

We want to identify

...

three things:

  • birds
  • fire
  • engine

If those

...

three things are present in an incident report, then we could

...

hypothesize that the incident is about engines

...

catching fire due to

...

bird collisions.

...

  1. So lets start by creating the {bird} tag:

Image Modified

...

2. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:

Image Modified

...

3. Add the following patterns to the entity recognizer.  (See the image below

...

for an example of how to do it.) Repeat

...

the steps for each of the following patterns:

  • duck
  • hawk
  • seagull

Image Modified

d.

...

Add the following regex in the simpleRegex recognizer. (Note: Steps are very similar to how entities were added in the Entity recognizer.)

  • bird[s]?

Image Modified

e. Now, do the same for the fire and engine tags:

Image Modified


Image Modified

Step #3: Add the {fire-by-bird} tag that will use the other tags

The idea here is to create a tag that will use {fire}, {engine} and {bird} to identify a concept which is engine got on fire due to birds.  For this special tag, we'll use the Fragmented recognizer. This is an advanced recognizer that

...

tags text that contains the other

...

three tags in any order of appearance and that are close enough

...

to one other within the aviation report.

...

  1. Create a tag called 'fire-by-bird'. Use similar steps

...

  1. to create the other tags.

...

  1. Attach the Fragmented recognizer to the tag.

...

  1. Add the following pattern: {fire} {engine} {bird}. Make sure to check the option of 'In Any Order', Max tokens at 16 and Min tokens at 4.

Image Modified

...

4. Make sure all of

...

the recognizers of all your tags are using the same pipeline (or the pipeline you need it to be).

5. Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline':

Image Modified

Step #4: Quick test using the preview

...

Test any of your tags using the preview functionality.

...

For example, let's test the {fire-by-bird} tag.

  1. Make sure to click on it in the Tag tree

...

  1. .
  2. Then enter the following text into the preview text box: "SEAGULL STRIKE INTO TURBINE ON TAKEOFF. SEVERE VIBRATION, SMOKE AND FLAME."

Image Modified


A dialog with the Saga graph

...

is diplayed.

Note how the {bird}, {fire}, {engine} and also the {fire-by-bird}  tags

...

are identified the text:

Image Modified

Step #5: Perform a

...

test run with a dataset

Once you have tested the performance of

...

your tags using the preview, then it might be a good idea to test it against bigger text.

At the moment, Saga comes with several testing datasets

...

. However, you can also

...

create your own and upload

...

them to a special folder in the Saga file system.

...

  1. Inside the {fire-by-bird} tag,

...

  1. select Test Run

...

  1. and then

...

  1. select the "--- New Test Run ---" option.

...

  1. Select the Aviation-Incidents dataset and

...

  1. Execute

...

  1. .

Image Modified

...

3.

...

Select the

...

Background Processes

...

tab to check the progress of the run.

Image Modified

...

4. Wait for

...

the test run to complete or for partial results while running

...

.

5. Select Open search

...

to open the search interface.

...

In this screen, you will find your tags as facets.

...

When selected, you'll see search results containing your tags.

In the following image, we are clearing facets and then selecting only {fire-by-bird} to check the comments that talk about engines

...

catching fire due to

...

bird collisions.

Image Modified

...

6. After reviewing the results, you can continue iterating on

...

a process of reviewing results and tweaking your tags and pipelines to create the best model to use for

...

your specific use case.

3.

...

2 Machine learning recognizers

3.2.1 Name

...

entity recognizer

The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained).

In addition, the recognizer can be used

...

with other recognizers to train a new model.

3.2.1.1 Using it as recognizer

In order to use it as

...

a recognizer:  Add the recognizer to your tag,

...

choose a model, select the probability threshold used to decide if something is a match or not, and

...

add normalization tags in case you want to cleanse and normalize the input.

Let's use the {human} tag to test this functionality:

...

  1. Select the {human} tag

...

  1. .
  2. Select the

...

  1. Entity recognizer and then click the gear button to bring up the settings.

...

  1. Disable the Entity recognizer

...

  1. .

Image Modified

...

4. Attach the Name Entity recognizer to the {human} tag.

...

5. Choose

...

a default model "en-ner-person.bin". This model

...

has been trained to identify English names of people.

Image Modified

...

6. Enter

...

the following text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph is one of them, Paul too".

As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}:

Image Modified

3.2.1.2 Using it as a trainer

In order to train a model

...

, use another recognizer as the base and a dataset that has a good sample of the desired values

...

to identify.

...

In this case, we'll use the CFR-2018 dataset which contains regulations from the government. 

...

  1. Create a new tag called {emissions-equipment}

...

  1. . Attach the Entity recognizer and add the following patterns:

...

    • vehicle

...

    • locomotive

...

    • truck

...

    • marine engine

...

    •  tanker truck

...

    •  engine

Image Modified

...

2. Attach the Name recognizer to your tag and

...

click Train.

3. Select the 'CFR-2018-title40' dataset and then

...

Execute.

Image Modified

...


4. Check out the Background Processes tab to see the progress of the training.

...

5. Once the process completes:

...

    • Go to your tag and disable the Entity recognizer.

...

    • In the Name recognizer, select your recently created model (result from the training). 

It should be something like "emissions-equipment-[date stamp here]",

...

For example: emissions-equipment-20190206172621. 

...

    • You can also use the option --LATEST-- so it will always use the most recent model you have created.

...

    • Set the 'Minimum Probability' field to 0.5

...

    • Run a TestRun against the CFR dataset.

...

    • Once the test run finishes, use the search interface to check if any text was tagged with the emissions-equipment tag. 

If you do see text tagged

...

, then the model you trained is working.

Image Modified

3.2.2 Classifier

...

recognizer

This recognizer is used to

...

perform binary classification of sentences. It also uses Apache OpenNLP internally and

...

can be used as a recognizer and as a trainer (just like the Name Entity recognizer).

The difference between the Name Entity recognizer and this one is that Name Entity is used to identity entities

...

; it will tag a word.

...

The classifier will tag

...

an entire sentence and

...

may use other algorithms not available in the Name Entity recognizer.

3.2.2.1 Using it as recognizer

In order to use it as a recognizer

...

, attach the recognizer to your tag and then select a model from the list

...

.

Image Modified

Then you can test its performance by running a test run.

3.2.2.2 Using it as a trainer

Training is

...

the same

...

as for the Name Entity recognizer. We need another recognizer to use as base, and a dataset with a good quantity of samples of the text we want to classify.

The following steps describe how to do a training

...

. We'll use the Aviation dataset and

...

will try to tag sentences that talk about incidents with birds. 

...

  1. Let's reuse the {bird} tag that we created

...

  1. previewly, and create a new one called {hit} with the following patterns:

Image Modified

...

2. Create a new tag called {bird-incident}

...

.

3. Attach the Fragmented recognizer and add the following pattern:

Image Modified

...

4. Add the Classification recognizer to the {bird-incident} tag, select '--NONE–' in the 'Model' field.

NOTE: Always remember to set this field to --NONE-- when training.

...

5. Select Train. When the dialog opens:

...

    • Select the Aviation dataset.

...

    • Select 'N-Gram' in the 'Feature Selection' field.

...

    • Increase the field 'max n-gram' to 3.

...

    • Select Execute.

So this will train a model using the Aviation dataset

...

using the pattern in the Fragmented recognizer.

Image Modified

...


6. Check out the 'Background Processes' tab and wait for the Classification training to

...

complete.

7. Once done, go back to the {bird-incident} tag and disable the Fragmented recognizer.

...

8. In the Classification recognizer, select your latest created model in the 'Model' field. It should be named like: bird-incident-[datetime stamp here].bin

...

.

For example: 'bird-incident-20190208173305.bin'. You can also use the option '--LATEST–' to always use the latest trained model.

...

9. Start a test run using the Aviation dataset.

...

10. Check out the 'Background Processes' tab for completion. 

11. When complete,

...

select Open Search

...

to check results in the Search screen

...

.

Image Modified


As you can see, the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case

...

, because the dataset is small and positive samples identified by the Fragmented recognizer were not that much, the Classification recognizer is not doing a very good job on identifying the sentences.

It is expected that with

...

more and better training data, the accuracy of the Classification recognizer

...

will improve.

NOTE: You can also play with

...

various training settings to determine which ones generate better results for your specific use case.