You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 11
Next »
Lemmatize tokens matched to words in a dictionary.
Operates On: Lexical Items with TOKEN
Generic Configuration Parameters
-
boundaryFlags ( type=string array
| optional
)
- List of vertex flags that indicate the beginning and end of a text block.
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"]) -
skipFlags ( type=string array
| optional
)
- Flags to be skipped by this stage.
Tokens marked with this flag will be ignored by this stage, and no processing will be performed. -
requiredFlags ( type=string array
| optional
)
- Lex items flags required by every token to be processed.
Tokens need to have all of the specified flags in order to be processed. -
atLeastOneFlag ( type=string array
| optional
)
- Lex items flags needed by every token to be processed.
Tokens will need at least one of the flags specified in this array. -
confidenceAdjustment ( type=double
| default=1
| required
)
- Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
-
debug ( type=boolean
| default=false
| optional
)
- Enable all debug log functionality for the stage, if any.
-
enable ( type=boolean
| default=true
| optional
)
- Indicates if the current stage should be consider for the Pipeline Manager
- Only applies for automatic pipeline building
Configuration Parameters
dictionary(string, optional) - The resource containing the list of words and relationships
- if no dictionary is provided a default dictionary will be use
include(list, optional) - A list of the relationships to include
exclude(list, optional) - A list of the relationships to exclude
- languageISO3 (string, optional) - The language you need the lemmatize stage use. The value needs to be one of the ISO 3 letter language codes.
- By default, English is always used unless configured otherwise. At the moment only English (ENG) and Spanish (SPA) are available.
{
"type": "Lemmatize",
"include" : ["pl", "vf"],
"exclude" : ["ob"],
"dictionary" : "lemmatize-provider:lemmatize_words",
"languageISO3":"SPA"
}
Example Output
V--------------------[I am liking this projects very much]--------------------V
^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^
^--[be]--^---[like]---^ ^--[project]---^
am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}
Output Flags
Lex-Item Flags:
- LEMMATIZE- All words retrived will be marked as LEMMATIZE
Resource Data
The resource data will be a json file with an array of words in a field named words. This when the 'dictionary' parameter is used.
{
"words": [
{
"confidence": 0.0049,
"rel": [
"wnm",
"sp"
],
"from": "encyclopaedia",
"to": "encyclopedia"
},
{
"confidence": 0.0752,
"rel": [
"wnm",
"sp"
],
"from": "word",
"to": "worth"
}
]
}
When the 'dictionary' parameter is not used an embedded wiktionary file will be used. This file is formatted as a 1 entry json per line:
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}
Relationships
The required fields for each entry are:
- from - Original word to search for
- this field will be eliminated once added to the entities of the LexItem
- to - Resulting word
- it will be a new LexItem on its own
- rel - List of relationships between the original word and the resulting word
- List of relationships in the default dictionary:
- pl - pluralization
- vf - verb form
- ob - obsolete
- syn - synonym
- alt - alternative
- wwm - word with meaning (more than one)
- wnm - word no meaning (no additional meaning)