Paragraph Filter
Description:Removes particular paragraphs from the token stream
Class:com.searchtechnologies.aspire.components.ParagraphFilter
Inputs:Stream of tokens
Outputs:Stream of tokens, possibly missing certain token blocks
Variables Written:none
@scope:no


Configuration

ElementTypeDefaultDescription
hashfileString-REQUIRED: Filename of file holding paragraph signatures and counts (tab-separated). If this file (or this option) is not present, this filter does no processing.
commonCountint0How common should a paragraph be in order to be removed? Entries in the above hashfile that have counts lower than this number are ignored.
actionString"remove"The action to take on recognized paragraphs. Possible actions: remove
paragraphSizeint250Maximum size of a paragraph in tokens. If larger paragraphs come down the token stream, either they will not be removed at all, or only the last paragraphSize tokens of them will be removed.
fieldsString-Whitespace-separated list of fieldnames to inspect. Fields not mentioned here will be unaffected (no paragraphs in those fields will be removed). The default (option missing or empty) is to look for paragraphs in all fields.
debugbooleanfalseTurn on debug output.

Example Configuration

 <processor class="com.searchtechnologies.aspire.components.ParagraphFilter">
   <fields>MainContentField</fields>
   <hashfile>data/common-paragraphs.txt</hashfile>
   <commonCount>50</commonCount>
 </processor>


  • No labels