Introduction
The co-occurrence or collocation of words to form short phrases (2-4 words) can be useful in tagging content and performing query enhancement by adding a level of meaning to these phrases and therefore improved relevancy for result sets.
The components described in this section take advantage of Wikipedia as a source for phrases and DBpedia and Wikilinks to add semantic meaning to those phrases. The basic architecture used is as follows:
Execution Order
After Aspire HDFS feed
- Token Processing
- Token Statistics
- Token Statistics Component (with content field)
- Statistical Phrases
- Token Merge (with content field)
- Document Merge
- Statistical Phrases Component
- Sort Phrases By Weight
Generate the Master Dictionary
- Use Export HDFS to Redis to add the Statistical Phrases dictionary to Redis Master Dictionary
- Add any external dictionaries to the Master Dictionary.
- Run Redis Bitmap Calculator to prepare the Master Dictionary for Phrase Extraction
Once the Master Dictionary is complete:
- Phrase Extraction
- Token Statistics
- Token Statistics Component (with tagged_phrases and non_tagged_tokens fields)
- Semantic Co-occurrence
- Token Merge (with tagged_phrases and non_tagged_tokens fields)
- Document Merge (using the previous token merge output)
- Co-occurrence Extractor
- Co-occurrence Merge
Other components
- Phrase Extract(from WikiPedia, WikiLinks and DPedia)
- Copy To HDFS
- Delete From HDFS
- Export HDFS To Redis
Overview
Content Tools