Table of Contents |
---|
...
For a better understanding of what Saga is and what the purpose of the UI is, please check out this presentation in Teams:
Office Powerpoint | ||||
---|---|---|---|---|
|
The UI is pretty simple, it has a main tab selector at the top where you can select which one of the 4 areas you want to work on:
...
In this section we'll go through the process of creating a set of tags, add some recognizers to them and finally test how they perform against a dataset. This will give the user a better idea of how the process/flow is when using the Saga UI.
We currently have a dataset loaded into Saga about aviation incidents. We will try to identify incidents where a engine gets on fire due to a bird.
One important thing to consider is what stages you want in the base pipeline used by recognizers. The base pipeline usually has some stages to pre-process text before passing it to the recognizers.
As you can see in the following image, we'll use the baseline-pipeline which has the following stages:
...
Saga is a comprehensive, easy-to-use middleware for maintainable and scalable Natural Language Understanding. It has automated pipeline construction, state-of-the-art handling of language ambiguity, integrated machine learning and business-friendly user interfaces for creating and maintaining language models at a reasonable cost. In addition to its out-of-the-box algorithms, Saga allows the usage of custom Python models. It is used by our customers to implement solutions like text/entity extraction, semantic search, text classification, extraction of knowledge graph relationships, question/answering, analytics on unstructured content, etc.
Tip |
---|
For a better understanding of what Saga is and what the purpose of the UI is, please check out this presentation: |
The UI is pretty simple. It has a main tab selector at the top where you can determine where you want to work (Tags, Evaluate, Pipelines, Datasets, Rules/Executors or Background Processes).
A Login page can be seen when security is enabled in the config file, it uses basic authentication against the user and password in the configuration
Users can define semantic tags to be used and the recognizers and settings that each tag will use.
But, what is a Semantic Tag anyway?
Semantic Tags are the organizing structure in Saga, they identify and interpret regions of text. They are basically anything you want to identify in text and you can name them whatever you want as long as it makes sense to you. For example, let's say I want to identify Emails in a document I'm processing with Saga, then I can create a tag named "email" or maybe "contact-email" or "eMail". This tag name will be used by Saga to show in the results where emails exists in text.
536px1200
Evaluate tab is used when the user needs to test several tags at once. It could be a quick test using the Preview functionality or starting a Test Run against a dataset. Test runs started in this screen are called 'Evaluations' because they include some statistics that can be used to compare 2 Evaluation runs that use different settings.
For each evaluation you can open the Search Interface or open detailed statistics:
Tip |
---|
In case you want to delete an Evaluation, it can be done in the 'Background Processes' tab. |
Users can add new, delete or update pipelines. Pipelines defines which stages are included in the pipeline before recognizer stages are added to the pipeline. So for example, you could have stages such as white space tokenizer, text case analyzer or a stop words identifier.
The user has the ability to move stages around using up and down buttons:
Also stages can be inserted at the desired position using the context menu:
Users can view the datasets loaded into the application to perform test runs and/or training of machine learning models. Define which fields to process from the dataset file and how to split the text in order to feed the pipeline. (Since users cannot upload datasets at the moment, datasets need to be placed in a special folder in the Saga Server file system.)
Warning |
---|
As of 1.3.3 this functionality is deprecated until integration with SearchAPI is done. |
Saga can be the engine behind semantic search, as it is for the ESUI (Enterprise Search User Interface). The user can create rules and define what rules will be executed for each tag that Saga identifies in a query, through the configuration done in this tab.
Note |
---|
Saga provides integration with a custom ESUI. |
Users can monitor background processes that are running. For example, when running a test run against a dataset, the process could take a long time to complete, so the user can check the progress in this screen.
Users can Export and Import all the data in Saga to .sg folder for backup or to import in another Saga
The environment tool will create and download a file with the current environment conditions, like RAM, CPU, hard drive space,...
Here is an example of the data the file will have
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{
"add-ons": {
"processors": [
"GoogleKnowledgeStage:1.3.1",
"NamePredictorStage:1.3.1",
"FaqStage:1.3.1",
"ClassificationStage:1.3.1",
"GoogleEntityPredictorStage:1.3.1"
],
"recognizers": [
"GoogleKnowledgeStage:1.3.1",
"NamePredictorStage:1.3.1",
"FaqStage:1.3.1",
"ClassificationStage:1.3.1",
"GoogleEntityPredictorStage:1.3.1"
]
},
"java": "11.0.2",
"os": "Windows 10",
"elasticsearch": "7.4.2",
"cpu": {
"usage (%)": {
"jvm": 7.678,
"system": 0.0
},
"processors": 12,
"arch": "amd64"
},
"memory (Mb)": {
"jvm": {
"committed": 8377,
"using": 4096,
"max": 10240
},
"system": {
"total": 32503,
"using": 15610,
"free": 16893
},
"swap": {
"total": 42743,
"using": 27392,
"free": 15351
}
},
"config": "C:\\Saga\\il Master\\saga-server\\config\\config.json",
"version": "1.3.1",
"version-date": "2020-03-05T10:14:59.561-06:00"
} |
The GPT3 Proxy tool allows the use of OpenAI libraries to create text and search in large documents. This page has more information for this tool.
The DXF Playground helps the user the creation of UI elements which can be rendered on different parts of SAGA. This page has more information for this tool.
Users can review results from a Test Run. The flow will be something like:
In case the user recognizes an entity in the text that is not being recognized by Saga it is possible to add it to the tag's entity recognizer dictionary by selecting the text and clicking the add button like this:
Note |
---|
Add to dictionary functionality is only available for test runs that use 1 tag. So it won't appear for Evaluation runs. |
In this section, we'll go through the process of creating a set of tags, adding some recognizers to them and testing how they perform against a dataset. This will give the user a better idea of the process/flow when using the Saga UI.
We currently have a dataset loaded into Saga about aviation incidents. We will try to identify incidents where an engine catches fire due to a bird collision.
An important consideration is which stages you want to include in the base pipeline used by recognizers. Pipelines usually have a set of stages to pre-process text before passing it to the recognizers.
As you can see in the following image, we'll use the baseline-pipeline that has the following stages:
We want to identify three things:
If those three things are present in an incident report, then we could hypothesize that the incident is about engines catching fire due to bird collisions.
{bird}
tag:Each tag has functionalities:
Add Child: Allows the user to add a child tag
Rename: Allows the user to change the name of the tag incase misspelled.
Delete: Allows the user to delete a tag. Just be careful when deleting parent tags because the delete functionality will perform a cascading delete for all child tags.
Find Tag Usage: Find if the selected tag was used on other tags. (For Example: when tag is used in Advance Recognizer).
Export: Allows you to export the tag in .sg file.
Cut: Allows you to cut and paste the tag on other tags.
2. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:
3. Add the following patterns to the entity recognizer: duck, ducks
4. Repeat the steps for each of the following patterns (you can also add their plural form as additional patterns):
Info |
---|
Check the settings for the Entity Recognizer, you may need to remove the ALL_LOWER_CASE flag from the 'Required flags' setting in order for the matching to work: |
5. Add the Simple Regex recognizer the same way you added the Entity Recognizer, and then add the following regex:
Info |
---|
Steps to add simpleRegex patterns are very similar to how entities were added in the Entity recognizer. We are using regex for this one just for demonstration purposes. You should just add the |
6. Now, do the same for fire and engine tags:
The idea here is to create a tag that will use {fire}
, {engine}
and {bird}
to identify a concept which is that the engine caught fire due to bird collisions. For this special tag, we'll use the Fragmented recognizer. This is an advanced recognizer that tags text that contains the other three tags in any order of appearance and that are close enough to one other within the aviation report.
fire-by-bird
. Use similar steps to create the other tags.{fire} {engine} {bird}
. Make sure to select the In Any Order check box; Max tokens at 10 and Min tokens at 4.4. For our feature example, make sure to remove the 'Sentence Breaker' stage from the 'Base Pipeline' in case it is present.
The 'Sentence Breaker' stage processes a text block and splits it in sentences using the language configured. It will use punctuation and language specifics to achieve this but in addition you can specify a list of additional 'breaker' characters.
For our example, we don't need the aviation incident to be split into sentences because each incident is a small portion of text. In case it has several sentences, they are all related to the same incident.
5. In Saga, by default the 'Whitespace Tokenizer' expects text blocks containing the 'SENTENCE' flag which is previously set by the 'Text Breaker' stage. Because we removed the 'Text Breaker' stage, we need to remove the 'SENTENCE' flag from the required list of flags in the 'Whitespace Tokenizer' stage configuration:
6. Make sure all of the recognizers of all your tags are using the same pipeline (or the pipeline you need it to be).
Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline'.
7. You can also import patterns from different data sources (For example JSON, JSONL, XML, XLSX and CSV).
Test any of your tags using the preview functionality. For example, let's test the {fire-by-bird}
tag.
A dialog with the Saga graph is displayed. Note how the {bird}
, {fire}
, {engine}
and also the {fire-by-bird}
tags are identified in the text.
Once you have tested the performance of your tags using the preview, then it might be a good idea to test it against bigger text.
You can create your own and upload them to a special folder in the Saga file system (learn how to do it in the Datasets article)
{fire-by-bird}
tag, select Start Test Run and then select the "--- New Test Run ---" option.3. Select the Background Processes tab to review the progress of the run.
4. Wait for the test run to complete or if you cannot wait you can click on the "Open search" button to see partial results while running.
5. When process is complete, click on Open search to open the search interface. In this screen, you will find your tags as facets. When selected, you'll see search results containing your tags.
In the following image, we are clearing facets and then selecting only {fire-by-bird}
to check the comments that talk about engines catching fire due to bird collisions.
6. After reviewing the results, you can continue iterating on a process of reviewing results and tweaking your tags and pipelines to create the best model to use for your specific use case.
Anchor | ||||
---|---|---|---|---|
|
The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained). In addition, the recognizer can be used with other recognizers to train a new model.
(If you need the model, Go to OpenNLP Models and look for en-ner-person.bin)
In order to use it as a recognizer: Add the recognizer to your tag, choose a model, select the probability threshold used to decide if something is a match or not, and add normalization tags in case you want to cleanse and normalize the input.
Let's use the {human}
tag to test this functionality:
{human}
tag.{human}
tag.6. Enter the following text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph and Paul are two of them". As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}
:
In order to train a model, use another recognizer as the base and a dataset that has a good sample of the desired values to identify. In this case, we'll use the 'CFR-2018' dataset which contains regulations from the government.
(You can get this dataset going to Documents/General/Saga Datasets inside the Microsoft Teams space for SAGA&ESUI here.)
{emissions-equipment}
. Attach the Entity recognizer and add the following patterns:2. Attach the Name recognizer to your tag and click Train.
3. Select the 'CFR-2018-title40' dataset and then Execute.
4. Check out the Background Processes tab to review the progress of the training.
5. Once the process completes:
emissions-equipment
tag. If you do see text tagged, then the model you trained is working.Anchor | ||||
---|---|---|---|---|
|
The Classifier recognizer is used to perform binary classification of sentences. It uses Apache OpenNLP internally and can be used as a recognizer and as a trainer (just like the Name Entity recognizer).
The difference between the Classifier recognizer and the Name Entity recognizer is that Name Entity is used to identity entities; it will tag a word. The classifier will tag an entire sentence and may use other algorithms not available in the Name Entity recognizer.
2. Then you can test its performance by running a test run.
3.2.2.2 Using it as a trainer
Training is the same as for the Name Entity recognizer. We need another recognizer to use as base, and a dataset with a good quantity of samples of the text we want to classify.
The following steps describe how to do a training. We'll use the 'Aviation' dataset and will try to tag sentences that talk about incidents with birds.
{bird}
tag that we created previously, and create a new one called {hit}
with the following patterns:2. Create a new tag called {bird-incident}.
3. Attach the Fragmented recognizer and add the following pattern:
4. Add the Classification recognizer to the {bird-incident}
tag, and select '--NONE–' in the 'Model' field.
Tip |
---|
Always remember to set this field to --NONE-- when training. |
5. Select Train. When the dialog opens:
So this will train a model using the Aviation dataset using the pattern in the Fragmented recognizer.
6. Check out the 'Background Processes' tab and wait for the Classification training to complete.
7. Once done, go back to the {bird-incident}
tag and disable the Fragmented recognizer.
8. In the Classification recognizer, select your latest created model in the 'Model' field. It should be named like: bird-incident-[datetime stamp here].bin. For example: 'bird-incident-20190208173305.bin'.
Tip |
---|
You can also use the option '--LATEST–' to always use the latest trained model. |
9. Start a test run using the Aviation dataset.
10. Check out the Background Processes tab for completion.
11. When complete, select Open Search to check results in the Search screen.
As you can see, the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case, because the dataset is small and positive samples identified by the Fragmented recognizer were not that much, the Classification recognizer is not doing a very good job of identifying the sentences. It is expected that with more and better training data, the accuracy of the Classification recognizer will improve.
Tip |
---|
You can also play with various training settings to determine which ones generate better results for your specific use case. |
This section describes Saga recognizers that were not used in any of the examples explained in previews sections.
Warning |
---|
This recognizer is part of the functionality that Saga has in place to implement semantic search. |
Best Bets featured in most search interfaces will highlight important information when certain keywords are detected in the query done by the user. Best Bets hits are not coming from the search engine itself but from a curated list of hits.
So let's say that when someone searches for "how to handle fire in a plane" best bets will detect the keywords 'fire' and 'plane' and therefore will show before the search results a link to the company's 'How to handle fire' manual. This manual is the official and recommended source of information for those specific cases.
Therefore the idea behind the Best Bets recognizer is to keep the curated list of best bets in Saga and then provide the information to ESUI when a query has been tagged with a best bet tag.
{fire-manual}
3. Add a new best bet pattern. This record will contain the following information:
4. Now that the tag and recognizer are correctly set we need to create a rule and executor that ESUI will run to properly show the best bet in the search results page.
4.1 Go to Rules/Executors tab
4.2 Click on the Executors sub tab
4.3 Add the following code in the "Process" code section.
Please notice:
const result = {
title: data.title,
description: data.description,
url: data.url,
};
return _saga.response(result, 'bestBets')
Your screen should look like this:
4.4 Click on the 'Rules' sub tab and add a new rule that uses the executor we created in the previous step:
4.5 Add the {fire-manual}
tag in the ESUI saga endpoint configuration ('tags' property)
4.6 When querying something with the word 'fire' which matches our best bet recognizer, your should see a best bet hit in the search results in ESUI:
We want to identify 3 things:
If those 3 things are present in an incident report then we could say that the incident is about engines getting on fire due to birds.
a. So lets start by creating the {bird} tag:
b. Add 'SimpleRegex' and 'Entity' recognizers to the bird tag:
c. Add the following patterns to the entity recognizer. See the image below to know how to do it. Repeat those steps for each of the following patterns
d. Also add the following regex in the simpleRegex recognizer. ( Note: Steps are very similar to how entities were added in Entity recognizer)
e. Now, do the same for the fire and engine tags:
The idea here is to create a tag that will use {fire}, {engine} and {bird} to identify a concept which is engine got on fire due to birds. For this special tag we'll use the Fragmented recognizer. This is an advanced recognizer that will tag text that contains the other 3 tags in any order of appearance and that are close enough from each other within the aviation report.
a. Create a tag called 'fire-by-bird'. Use similar steps you used to create the other tags.
b. Attach the Fragmented recognizer to the tag
c. Add the following pattern: {fire} {engine} {bird}. Make sure to check the option of 'In Any Order', Max tokens at 16 and Min tokens at 4
d. Make sure all of your recognizers of all your tags are using the same pipeline or the pipeline you need it to be. Click on the gear icon in each one of the recognizers to open its settings and check the field 'Base Pipeline':
You are able to test any of your tags using the preview functionality. Let's test the {fire-by-bird} tag. Make sure to click on it in the Tag tree, then enter the following text into the preview text box: "SEAGULL STRIKE INTO TURBINE ON TAKEOFF. SEVERE VIBRATION, SMOKE AND FLAME."
A dialog with the Saga graph will be shown. Note how the {bird}, {fire}, {engine} and also the {fire-by-bird} tags have identified the text:
Once you have tested the performance of you tags using the preview then it might be a good idea to test it against bigger text.
At the moment Saga comes with several testing datasets but you can also create your own and upload it to a special folder in Saga file system.
a. Always inside the {fire-by-bird} tag, click on the "Test Run" button and then click the "--- New Test Run ---" option
b. Select the Aviation-Incidents dataset and click on Execute button
c. Click the "Background Processes" tab to check the progress of the run
d. Wait for completion of the test run or for partial results while running, click on the "Open search" button to open the search interface.
In this screen you will find your tags as facets. So when selected you'll see search results containing your tags. In the following image we are clearing facets then selecting only {fire-by-bird} to check the comments that talk about engines getting on fire due to birds.
e. After reviewing results you can continue iterating on this process of reviewing results and tweaking your tags and pipelines to create the best model to use for you specific use case.
The name entity recognizer uses Apache OpenNLP to tag text using an existing model (previously trained).
In addition the recognizer can be used together with other recognizers to train a new model.
In order to use it as recognizer, you only need to add the recognizer to your tag, then choose a model, the probability threshold used to decide if something is a match or not and finally normalization tags in case you want to cleanse and normalize the input.
Let's use the {human} tag to test this functionality:
a. Click on the {human} tag
b. Click on the Entity recognizer and then click the gear button to bring up the settings
c. Make the Entity recognizer disabled
d. Attach the Name Entity recognizer to the {human} tag
e. Choose one of the models that come by default: "en-ner-person.bin". This model was trained to identify English names of people.
f. Enter this text in the preview in order to check out the Saga graph: "Several employees work from home, Joseph is one of them, Paul too". As you can see in the following image, the recognizer tags 'Joseph' and 'Paul' as {human}:
In order to train a model we need to use another recognizer as the base and a dataset that has a good sample of the values we want to identify. In this case we'll use the CFR-2018 dataset which contains regulations from the government.
a. Create a new tag called {emissions-equipment}, attach the Entity recognizer and add the following patterns:
- vehicle
- locomotive
- truck
-marine engine
-tanker truck
-engine
b. Attach the Name recognizer to your tag and then click on the train button. Select the 'CFR-2018-title40' dataset and then click on execute button:
d. Check out the Background Processes tab to see the progress of the training
e. Once the process completes:
- go to your tag and disable the Entity recognizer
- in the Name recognizer, select your recently created model (result from the training). It should be something like "emissions-equipment-[date stamp here]", for example: emissions-equipment-20190206172621.
You can also use the option --LATEST-- so it will always use the most recent model you have created.
- Set the 'Minimum Probability' field to 0.5
- Run a TestRun against the CFR dataset.
- Once the test run finishes, use the search interface to check if any text was tagged with the emissions-equipment tag. If you do see text tagged this means the model you trained is working.
This recognizer is used to do binary classification of sentences. It also uses Apache OpenNLP internally and it can be used as a recognizer and as a trainer just like the Name Entity recognizer.
The difference between the Name Entity recognizer and this one is that Name Entity is used to identity entities, it will tag a word. The classifier will tag a whole sentence and it could use other algorithms not available in the Name Entity recognizer.
In order to use it as a recognizer you just need to attach the recognizer to your tag and then select a model from the list:
Then you can test its performance by running a test run.
3.2.2.2 Using it as a trainer
Training is done the same way it is done in the Name Entity recognizer. We need another recognizer to use as base and a dataset with a good quantity of samples of the test we want to classify.
As example, we'll use the Aviation dataset and we will try to tag sentences that talk about incidents with birds.
a. Let's reuse the {bird} tag we created on previews steps and create a new one called {hit} with the following patterns:
b. Create a new tag called {bird-incident}, attach the Fragmented recognizer and add the following pattern:
c. Add the Classification recognizer to the {bird-incident} tag, select '--NONE–' in the 'Model' field
d. Click on the 'Train' button, when the dialog opens:
- select the Aviation dataset
- select 'N-Gram' in the 'Feature Selection' field
- increase the field 'max n-gram' to 3
- click on Execute button
So this will train a model using the Aviation dataset and using the pattern in the Fragmented recognizer.
e. Check in 'Background Processes' tab when the Classification training is done. Once done, go back to the {bird-incident} tag and disable the Fragmented recognizer.
f. In the Classification recognizer, select your latest created model in the model field. It should be named like bird-incident-[datetime stamp here].bin, for example: 'bird-incident-20190208173305.bin'. You can also use the option '--LATEST–' to always use the latest generated model
g. Start a test run using the Aviation dataset.
h. Check in 'Background Processes' tab when the test run completes and then open the Search screen to check out results:
As you can see the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case though, because the dataset is small and the positive samples identified by the Fragmented recognizer were not that big, the Classification recognizer is not doing a very good job on identifying the sentences.
...