1. Introduction

For a better understanding of what Saga is and what the purpose of the UI is, please check out this presentation:

PDF

name	Saga Introduction - User Manual.pdf

2. Exploring the UI

The UI is pretty simple. It has a main tab selector at the top where you can determine where you want to work (Tags, Pipelines, Datasets or Background Processes).

Evaluate

Evaluate tab is used when the user needs to test several tags at once. It could be a quick test using the Preview functionality or a Test Run can be started. Test runs started in this screen are called 'Evaluations' because they include some statistics that can be used to compare 2 Evaluation runs using different settings.

For each evaluation you can open the Search Interface or open detailed statistics:

Pipelines

Users can add new, delete or update pipelines. Pipelines defines which stages are included in the pipeline before recognizer stages are added to the pipeline. So for example, you could have stages such as white space tokenizer, text case analyzer or a stop words identifier.

Image Modified

Datasets

Users can view the datasets loaded into the application to perform test runs and/or training of machine learning models. Define which fields to process from the dataset file and how to split the text in order to feed the pipeline. (Since users cannot upload datasets at the moment, datasets need to be placed in a special folder in the Saga Server file system.)

Image Removed

Image Added

Background Processes

Users can monitor background processes that are running. For example, when running a test run against a dataset, the process could take a long time to complete, so the user can check the progress in this screen.

Image RemovedImage Added

Search Interface

Users can review results from a Test Run. The flow will be something like:

Add a semantic tag.
Add recognizers to the tag and configure them.
Test the effectiveness of the tag and its recognizers by running a test run against a dataset file.
To review results, open the Search interface to see how well (or poorly) text was tagged.

Image RemovedImage Added

3. Using the UI

3.1. End to end use case

In this section, we'll go through the process of creating a set of tags, adding some recognizers to them and testing how they perform against a dataset. This will give the user a better idea of the process/flow when using the Saga UI.

We currently have a dataset loaded into Saga about aviation incidents. We will try to identify incidents where an engine catches fire due to a bird collision.

Step #1: Check and choose your base pipeline

An important consideration is which stages you want to include in the base pipeline used by recognizers. The base pipeline usually has some stages to pre-process text before passing it to the recognizers.

As you can see in the following image, we'll use the baseline-pipeline that has the following stages:

WhiteSpaceTokenizer: Splits sentences into words using the white space as a separator.
StopWords: Identifies very common words that do not add any value to the process. (For example, 'the', 'a', 'this', 'those', etc.)
CaseAnalysis: Identifies whether or not a word is all UPPERCASE or lowercase and then converts text to lowercase. This is usually used to normalize words so that they match easily when creating patterns in recognizers.
CharChangeSplitter: Separates tokens based on character changes (from lowercase-uppercase, letter-number, alphanumeric-punctuation) without taking any character in the vertex, and respecting the capital letter.

Step #2: Create basic tags you will need

We want to identify three things:

birds
fire
engine

If those three things are present in an incident report, then we could hypothesize that the incident is about engines catching fire due to bird collisions.

So lets start by creating the {bird} tag:

2. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:

3. Add the following patterns to the entity recognizer. (See the image below for an example of how to do it.)

4. Repeat the steps for each of the following patterns:

- duck
- hawk
- seagull

4. Add the following regex in the simpleRegex recognizer. Note: Steps are very similar to how entities were added in the Entity recognizer.

- bird[s]?

5. Now, do the same for the fire and engine tags:

Step #3: Add the {fire-by-bird} tag that will use the other tags

The idea here is to create a tag that will use {fire}, {engine} and {bird} to identify a concept which is that the engine caught fire due to bird collisions. For this special tag, we'll use the Fragmented recognizer. This is an advanced recognizer that tags text that contains the other three tags in any order of appearance and that are close enough to one other within the aviation report.

Create a tag called fire-by-bird. Use similar steps to create the other tags.
Attach the Fragmented recognizer to the tags.
Add the following pattern: {fire} {engine} {bird}. Make sure to select the In Any Order check box; Max tokens at 16 and Min tokens at 4.

4. Make sure all of the recognizers of all your tags are using the same pipeline (or the pipeline you need it to be).

5. Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline'.

Step #4: Quick test using the preview

Test any of your tags using the preview functionality. For example, let's test the {fire-by-bird} tag.

Make sure to click on it in the Tag tree.
Then enter the following text into the preview text box: "SEAGULL STRIKE INTO TURBINE ON TAKEOFF. SEVERE VIBRATION, SMOKE AND FLAME."

A dialog with the Saga graph is diplayed. Note how the {bird}, {fire}, {engine} and also the {fire-by-bird} tags are identified in the text.

Step #5: Perform a test run with a dataset

Once you have tested the performance of your tags using the preview, then it might be a good idea to test it against bigger text.

At the moment, Saga comes with several testing datasets. However, you can also create your own and upload them to a special folder in the Saga file system.

Inside the {fire-by-bird} tag, select Start Test Run and then select the "--- New Test Run ---" option.
Select the Aviation-Incidents dataset check box and Execute.

3. Select the Background Processes tab to review the progress of the run.

4. Wait for the test run to complete or for partial results while running.

5. Select Open search to open the search interface. In this screen, you will find your tags as facets. When selected, you'll see search results containing your tags.

In the following image, we are clearing facets and then selecting only {fire-by-bird} to check the comments that talk about engines catching fire due to bird collisions.

6. After reviewing the results, you can continue iterating on a process of reviewing results and tweaking your tags and pipelines to create the best model to use for your specific use case.

3.2 Machine learning recognizers

Anchor
NameEntityRecognizer
NameEntityRecognizer
3.2.1 Name entity recognizer

The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained). In addition, the recognizer can be used with other recognizers to train a new model.

3.2.1.1 Using it as recognizer

In order to use it as a recognizer: Add the recognizer to your tag, choose a model, select the probability threshold used to decide if something is a match or not, and add normalization tags in case you want to cleanse and normalize the input.

Let's use the {human} tag to test this functionality:

Select the {human} tag.
Select the Entity recognizer and then click the gear button to bring up the settings.
Disable the Entity recognizer.

4. Attach the Name Entity recognizer to the {human} tag.

5. Choose a default model "en-ner-person.bin". This model has been trained to identify English names of people.

6. Enter the following text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph is one of them, Paul too". As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}:

3.2.1.2 Using it as a trainer

In order to train a model, use another recognizer as the base and a dataset that has a good sample of the desired values to identify. In this case, we'll use the 'CFR-2018' dataset which contains regulations from the government.

Create a new tag called {emissions-equipment}. Attach the Entity recognizer and add the following patterns:

- vehicle
- locomotive
- truck
- marine engine
- tanker truck
- engine

2. Attach the Name recognizer to your tag and click Train.

3. Select the 'CFR-2018-title40' dataset and then Execute.

4. Check out the Background Processes tab to review the progress of the training.

5. Once the process completes:

- Go to your tag and disable the Entity recognizer.
- In the Name recognizer, select your recently created model (result from the training). It should be something like "emissions-equipment-[date stamp here]", For example: 'emissions-equipment-20190206172621'.
- You can also use the option --LATEST-- so it will always use the most recent model you have created.
- Set the 'Minimum Probability' field to 0.5
- Run a TestRun against the CFR dataset.
- Once the test run finishes, use the Search interface to check if any text was tagged with the emissions-equipment tag. If you do see text tagged, then the model you trained is working.

Anchor
ClassifierRecognizer
ClassifierRecognizer
3.2.2 Classifier recognizer

The Classifier recognizer is used to perform binary classification of sentences. It also uses Apache OpenNLP internally and can be used as a recognizer and as a trainer (just like the Name Entity recognizer).

The difference between the Classifier recognizer and the Name Entity recognizer is that Name Entity is used to identity entities; it will tag a word. The classifier will tag an entire sentence and may use other algorithms not available in the Name Entity recognizer.

3.2.2.1 Using it as recognizer

In order to use it as a recognizer, attach the recognizer to your tag and then select a model from the list.

2. Then you can test its performance by running a test run.

3.2.2.2 Using it as a trainer

Training is the same as for the Name Entity recognizer. We need another recognizer to use as base, and a dataset with a good quantity of samples of the text we want to classify.

The following steps describe how to do a training. We'll use the 'Aviation' dataset and will try to tag sentences that talk about incidents with birds.

Let's reuse the {bird} tag that we created previously, and create a new one called {hit} with the following patterns:

2. Create a new tag called {bird-incident}.

3. Attach the Fragmented recognizer and add the following pattern:

4. Add the Classification recognizer to the {bird-incident} tag, and select '--NONE–' in the 'Model' field.

NOTE: Always remember to set this field to --NONE-- when training.

5. Select Train. When the dialog opens:

- Select the 'Aviation' dataset.
- Select 'N-Gram' in the 'Feature Selection' field.
- Increase the field 'max n-gram' to 3.
- Select Execute.

So this will train a model using the Aviation dataset using the pattern in the Fragmented recognizer.

6. Check out the 'Background Processes' tab and wait for the Classification training to complete.

7. Once done, go back to the {bird-incident} tag and disable the Fragmented recognizer.

8. In the Classification recognizer, select your latest created model in the 'Model' field. It should be named like: bird-incident-[datetime stamp here].bin. For example: 'bird-incident-20190208173305.bin'.
You can also use the option '--LATEST–' to always use the latest trained model.

9. Start a test run using the Aviation dataset.

10. Check out the Background Processes tab for completion.

11. When complete, select Open Search to check results in the Search screen.

As you can see, the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case, because the dataset is small and positive samples identified by the Fragmented recognizer were not that much, the Classification recognizer is not doing a very good job of identifying the sentences. It is expected that with more and better training data, the accuracy of the Classification recognizer will improve.

NOTE: You can also play with various training settings to determine which ones generate better results for your specific use case.

Page tree

Versions Compared

Old Version 30

New Version 31

Key

1. Introduction

2. Exploring the UI

Tags

Evaluate

Pipelines

Datasets

Background Processes

Search Interface

3. Using the UI

3.1. End to end use case

Step #1: Check and choose your base pipeline

Step #2: Create basic tags you will need

Step #3: Add the {fire-by-bird} tag that will use the other tags

Step #4: Quick test using the preview

Step #5: Perform a test run with a dataset

3.2 Machine learning recognizers

Anchor
NameEntityRecognizer
NameEntityRecognizer
3.2.1 Name entity recognizer

3.2.1.1 Using it as recognizer

3.2.1.2 Using it as a trainer

Anchor
ClassifierRecognizer
ClassifierRecognizer
3.2.2 Classifier recognizer

3.2.2.1 Using it as recognizer

Page tree

Page History

Versions Compared

Old Version 30

New Version 31

Key

1. Introduction

2. Exploring the UI

Tags

Evaluate

Pipelines

Datasets

Background Processes

Search Interface

3. Using the UI

3.1. End to end use case

Step #1: Check and choose your base pipeline

Step #2: Create basic tags you will need

Step #3: Add the {fire-by-bird} tag that will use the other tags

Step #4: Quick test using the preview

Step #5: Perform a test run with a dataset

3.2 Machine learning recognizers

AnchorNameEntityRecognizerNameEntityRecognizer3.2.1 Name entity recognizer

3.2.1.1 Using it as recognizer

3.2.1.2 Using it as a trainer

AnchorClassifierRecognizerClassifierRecognizer3.2.2 Classifier recognizer

3.2.2.1 Using it as recognizer

Anchor
NameEntityRecognizer
NameEntityRecognizer
3.2.1 Name entity recognizer

Anchor
ClassifierRecognizer
ClassifierRecognizer
3.2.2 Classifier recognizer