Introduction

Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Wikipedia

The co-occurrence or collocation of words to form short phrases (2-4 words) can be useful in tagging content and performing query enhancement by adding a level of meaning to these phrases and therefore improved relevancy for result sets.

The components described in this section take advantage of Wikipedia as a source for phrases and DBpedia and Wikilinks to add semantic meaning to those phrases. The basic architecture used is as follows:

Execution order

After Aspire HDFS feed:

Token Processing
- Token Processing Component
Token Statistics
- Token Statistics Component (with content field)
Statistical Phrases
1. Token Merge (with content field)
2. Document Merge
3. Statistical Phrases Component
4. Sort Phrases By Weight

Generate the Master Dictionary

Use Export HDFS to Redis to add the Statistical Phrases dictionary to Redis Master Dictionary
Add any external dictionaries to the Master Dictionary.
Run Redis Bitmap Calculator to prepare the Master Dictionary for Phrase Extraction

Once the Master Dictionary is complete:

Phrase Extraction
- Phrase Extraction Component
Token Statistics
- Token Statistics Component (with tagged_phrases and non_tagged_tokens fields)
Semantic Co-occurrence
1. Token Merge (with tagged_phrases and non_tagged_tokens fields)
2. Document Merge (using the previous token merge output)
3. Co-occurrence Extractor
4. Co-occurrence Merge

Page tree

Introduction

Execution order

Other components

Page tree

Semantic Co-occurrence Solution Overview

Introduction

Execution order

Other components