For a better understanding of what Saga is and what the purpose of the UI is, please check out this presentation:
|
The UI is pretty simple. It has a main tab selector at the top where you can determine where you want to work (Tags, Pipelines, Datasets or Background Processes).
Users can define semantic tags to be used, and the recognizers and settings that each tag will use.
Evaluate tab is used when the user needs to test several tags at once. It could be a quick test using the Preview functionality or a Test Run can be started. Test runs started in this screen are called 'Evaluations' because they include some statistics that can be used to compare 2 Evaluation runs using different settings.
For each evaluation you can open the Search Interface or open detailed statistics:
Users can add new, delete or update pipelines. Pipelines defines which stages are included in the pipeline before recognizer stages are added to the pipeline. So for example, you could have stages such as white space tokenizer, text case analyzer or a stop words identifier.
Users can view the datasets loaded into the application to perform test runs and/or training of machine learning models. Define which fields to process from the dataset file and how to split the text in order to feed the pipeline. (Since users cannot upload datasets at the moment, datasets need to be placed in a special folder in the Saga Server file system.)
Users can monitor background processes that are running. For example, when running a test run against a dataset, the process could take a long time to complete, so the user can check the progress in this screen.
Users can review results from a Test Run. The flow will be something like:
In this section, we'll go through the process of creating a set of tags, adding some recognizers to them and testing how they perform against a dataset. This will give the user a better idea of the process/flow when using the Saga UI.
We currently have a dataset loaded into Saga about aviation incidents. We will try to identify incidents where an engine catches fire due to a bird collision.
An important consideration is which stages you want to include in the base pipeline used by recognizers. The base pipeline usually has some stages to pre-process text before passing it to the recognizers.
As you can see in the following image, we'll use the baseline-pipeline that has the following stages:
We want to identify three things:
If those three things are present in an incident report, then we could hypothesize that the incident is about engines catching fire due to bird collisions.
{bird}
tag:2. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:
3. Add the following patterns to the entity recognizer. (See the image below for an example of how to do it.)
4. Repeat the steps for each of the following patterns:
4. Add the following regex in the simpleRegex recognizer. Note: Steps are very similar to how entities were added in the Entity recognizer.
5. Now, do the same for the fire and engine tags:
The idea here is to create a tag that will use {fire}
, {engine}
and {bird}
to identify a concept which is that the engine caught fire due to bird collisions. For this special tag, we'll use the Fragmented recognizer. This is an advanced recognizer that tags text that contains the other three tags in any order of appearance and that are close enough to one other within the aviation report.
fire-by-bird
. Use similar steps to create the other tags.{fire} {engine} {bird}
. Make sure to select the In Any Order check box; Max tokens at 16 and Min tokens at 4.4. Make sure all of the recognizers of all your tags are using the same pipeline (or the pipeline you need it to be).
5. Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline'.
Test any of your tags using the preview functionality. For example, let's test the {fire-by-bird}
tag.
A dialog with the Saga graph is diplayed. Note how the {bird}
, {fire}
, {engine}
and also the {fire-by-bird}
tags are identified in the text.
Once you have tested the performance of your tags using the preview, then it might be a good idea to test it against bigger text.
At the moment, Saga comes with several testing datasets. However, you can also create your own and upload them to a special folder in the Saga file system.
{fire-by-bird}
tag, select Start Test Run and then select the "--- New Test Run ---" option.3. Select the Background Processes tab to review the progress of the run.
4. Wait for the test run to complete or for partial results while running.
5. Select Open search to open the search interface. In this screen, you will find your tags as facets. When selected, you'll see search results containing your tags.
In the following image, we are clearing facets and then selecting only {fire-by-bird}
to check the comments that talk about engines catching fire due to bird collisions.
6. After reviewing the results, you can continue iterating on a process of reviewing results and tweaking your tags and pipelines to create the best model to use for your specific use case.
Anchor | ||||
---|---|---|---|---|
|
The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained). In addition, the recognizer can be used with other recognizers to train a new model.
In order to use it as a recognizer: Add the recognizer to your tag, choose a model, select the probability threshold used to decide if something is a match or not, and add normalization tags in case you want to cleanse and normalize the input.
Let's use the {human}
tag to test this functionality:
{human}
tag.4. Attach the Name Entity recognizer to the {human}
tag.
5. Choose a default model "en-ner-person.bin". This model has been trained to identify English names of people.
6. Enter the following text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph is one of them, Paul too". As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}
:
In order to train a model, use another recognizer as the base and a dataset that has a good sample of the desired values to identify. In this case, we'll use the 'CFR-2018' dataset which contains regulations from the government.
{emissions-equipment}
. Attach the Entity recognizer and add the following patterns:2. Attach the Name recognizer to your tag and click Train.
3. Select the 'CFR-2018-title40' dataset and then Execute.
4. Check out the Background Processes tab to review the progress of the training.
5. Once the process completes:
emissions-equipment
tag. If you do see text tagged, then the model you trained is working.Anchor | ||||
---|---|---|---|---|
|
The Classifier recognizer is used to perform binary classification of sentences. It also uses Apache OpenNLP internally and can be used as a recognizer and as a trainer (just like the Name Entity recognizer).
The difference between the Classifier recognizer and the Name Entity recognizer is that Name Entity is used to identity entities; it will tag a word. The classifier will tag an entire sentence and may use other algorithms not available in the Name Entity recognizer.
2. Then you can test its performance by running a test run.
3.2.2.2 Using it as a trainer
Training is the same as for the Name Entity recognizer. We need another recognizer to use as base, and a dataset with a good quantity of samples of the text we want to classify.
The following steps describe how to do a training. We'll use the 'Aviation' dataset and will try to tag sentences that talk about incidents with birds.
{bird}
tag that we created previously, and create a new one called {hit}
with the following patterns:2. Create a new tag called {bird-incident}.
3. Attach the Fragmented recognizer and add the following pattern:
4. Add the Classification recognizer to the {bird-incident}
tag, and select '--NONE–' in the 'Model' field.
NOTE: Always remember to set this field to --NONE-- when training.
5. Select Train. When the dialog opens:
So this will train a model using the Aviation dataset using the pattern in the Fragmented recognizer.
6. Check out the 'Background Processes' tab and wait for the Classification training to complete.
7. Once done, go back to the {bird-incident}
tag and disable the Fragmented recognizer.
8. In the Classification recognizer, select your latest created model in the 'Model' field. It should be named like: bird-incident-[datetime stamp here].bin. For example: 'bird-incident-20190208173305.bin'.
You can also use the option '--LATEST–' to always use the latest trained model.
9. Start a test run using the Aviation dataset.
10. Check out the Background Processes tab for completion.
11. When complete, select Open Search to check results in the Search screen.
As you can see, the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case, because the dataset is small and positive samples identified by the Fragmented recognizer were not that much, the Classification recognizer is not doing a very good job of identifying the sentences. It is expected that with more and better training data, the accuracy of the Classification recognizer will improve.
NOTE: You can also play with various training settings to determine which ones generate better results for your specific use case.