Page History
For Information on Aspire 3.1 Click Here
Paragraph Filter | |
---|---|
Description: | Removes particular paragraphs from the token stream |
Class: | com.searchtechnologies.aspire.components.ParagraphFilter |
Inputs: | Stream of tokens |
Outputs: | Stream of tokens, possibly missing certain token blocks |
Variables Written: | none |
@scope: | no |
Configuration
Element | Type | Default | Description |
---|---|---|---|
hashfile | String | - | REQUIRED: Filename of file holding paragraph signatures and counts (tab-separated). If this file (or this option) is not present, this filter does no processing. |
commonCount | int | 0 | How common should a paragraph be in order to be removed? Entries in the above hashfile that have counts lower than this number are ignored. |
action | String | "remove" | The action to take on recognized paragraphs. Possible actions: remove |
paragraphSize | int | 250 | Maximum size of a paragraph in tokens. If larger paragraphs come down the token stream, either they will not be removed at all, or only the last paragraphSize tokens of them will be removed. |
fields | String | - | Whitespace-separated list of fieldnames to inspect. Fields not mentioned here will be unaffected (no paragraphs in those fields will be removed). The default (option missing or empty) is to look for paragraphs in all fields. |
debug | boolean | false | Turn on debug output. |
Example Configuration
<processor class="com.searchtechnologies.aspire.components.ParagraphFilter"> <fields>MainContentField</fields> <hashfile>data/common-paragraphs.txt</hashfile> <commonCount>50</commonCount> </processor>
Overview
Content Tools