Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

3.2.2 Classifier Recognizer

This recognizer is used to do binary classification of sentences. It also uses Apache OpenNLP internally and it can be used as a recognizer and as a trainer just like the Name Entity recognizer.

The difference between the Name Entity recognizer and this one is that Name Entity is used to identity entities, it will tag a word.  The classifier will tag a whole sentence and it could use other algorithms not available in the Name Entity recognizer.

3.2.2.1 Using it as recognizer

In order to use it as a recognizer you just need to attach the recognizer to your tag and then select a model from the list:


Image Added

Then you can test its performance by running a test run.


3.2.2.2 Using it as a trainer

Training is done the same way it is done in the Name Entity recognizer. We need another recognizer to use as base and a dataset with a good quantity of samples of the test we want to classify.

As example, we'll use the Aviation dataset and we will try to tag sentences that talk about incidents with birds. 

a. Let's reuse the {bird} tag we created on previews steps and create a new one called {hit} with the following patterns:

Image Added

b. Create a new tag called {bird-incident}, attach the Fragmented recognizer and add the following pattern:

Image Added

c. Add the Classification recognizer to the {bird-incident} tag, select '--NONE–' in the 'Model' field

d. Click on the 'Train' button, when the dialog opens:

  - select the Aviation dataset

  - select 'N-Gram' in the 'Feature Selection' field

  - increase the field 'max n-gram' to 3

  - click on Execute button

So this will train a model using the Aviation dataset and using the pattern in the Fragmented recognizer.

Image Added


e. Check in 'Background Processes' tab when the Classification training is done. Once done, go back to the {bird-incident} tag and disable the Fragmented recognizer.

f. In the Classification recognizer, select your latest created model in the model field. It should be named like bird-incident-[datetime stamp here].bin, for example: 'bird-incident-20190208173305.bin'. You can also use the option '--LATEST–' to always use the latest generated model

g. Start a test run using the Aviation dataset.

h. Check in 'Background Processes' tab when the test run completes and then open the Search screen to check out results:

Image Added

As you can see the Classification recognizer is tagging some sentences that in theory are supposed to be related to incidents with birds. For this case though, because the dataset is small and the positive samples identified by the Fragmented recognizer were not that big, the Classification recognizer is not doing a very good job on identifying the sentences.

It is expected that with way more and better training data the accuracy of the Classification recognizer improves. You can also play with all different setting at the time of training and decide which ones brings better results for your specific use case.