Paragraph Filter (Aspire 2)

Paragraph Filter
Description:	Removes particular paragraphs from the token stream
Class:	com.searchtechnologies.aspire.components.ParagraphFilter
Inputs:	Stream of tokens
Outputs:	Stream of tokens, possibly missing certain token blocks
Variables Written:	none
@scope:	no

Configuration

Element	Type	Default	Description
hashfile	String	-	REQUIRED: Filename of file holding paragraph signatures and counts (tab-separated). If this file (or this option) is not present, this filter does no processing.
commonCount	int	0	How common should a paragraph be in order to be removed? Entries in the above hashfile that have counts lower than this number are ignored.
action	String	"remove"	The action to take on recognized paragraphs. Possible actions: remove
paragraphSize	int	250	Maximum size of a paragraph in tokens. If larger paragraphs come down the token stream, either they will not be removed at all, or only the last paragraphSize tokens of them will be removed.
fields	String	-	Whitespace-separated list of fieldnames to inspect. Fields not mentioned here will be unaffected (no paragraphs in those fields will be removed). The default (option missing or empty) is to look for paragraphs in all fields.
debug	boolean	false	Turn on debug output.

Example Configuration

 <processor class="com.searchtechnologies.aspire.components.ParagraphFilter">
   <fields>MainContentField</fields>
   <hashfile>data/common-paragraphs.txt</hashfile>
   <commonCount>50</commonCount>
 </processor>

Page tree

Paragraph Filter (Aspire 2)

Configuration

Example Configuration