Introduction
Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Wikipedia
The co-occurrence or collocation of words to form short phrases (2-4 words) can be useful in tagging content and performing query enhancement by adding a level of meaning to these phrases and therefore improved relevancy for result sets.
The components described in this section take advantage of Wikipedia as a source for phrases and DBpedia and Wikilinks to add semantic meaning to those phrases. The basic architecture used is as follows:
Execution order
After Aspire HDFS feed:
- Token Processing
- Token Statistics
- Token Statistics Component (with content field)
- Statistical Phrases
- Token Merge (with content field)
- Document Merge
- Statistical Phrases Component
- Sort Phrases By Weight
Generate the Master Dictionary
- Use Export HDFS to Redis to add the Statistical Phrases dictionary to Redis Master Dictionary
- Add any external dictionaries to the Master Dictionary.
- Run Redis Bitmap Calculator to prepare the Master Dictionary for Phrase Extraction
Once the Master Dictionary is complete:
- Phrase Extraction
- Token Statistics
- Token Statistics Component (with tagged_phrases and non_tagged_tokens fields)
- Semantic Co-occurrence
- Token Merge (with tagged_phrases and non_tagged_tokens fields)
- Document Merge (using the previous token merge output)
- Co-occurrence Extractor
- Co-occurrence Merge