The Aspire Lucene component provides the Lucene classes to other bundles and methods for some commonly used Lucene functionality.

This component exists as a holder for the Lucene libraries and exports the Lucene classes for use in other components.

It also provides convenient methods for indexing and searching in an index controlled by the component, although configuration of this index is optional. The services are disabled if the index is not configured.

Lucene Services
Factory Namecom.searchtechnologies.aspire:aspire-lucene
subType

default

InputsMethod calls
OutputsLucene index (optional)

Configuration

ElementTypeDefaultDescription
indexDirectorystring<none>The direcotry on disk of a Lucene index. The index will be created if if does not exist. If this parameter is not given, index and searching methods will not be available.
documentIDstring<none>The Lucene field to be used as the document id for deletes and updates. If not specified, documents may be added to the index, but updates and deletes will not be available.
luceneMaxFieldLengthint10000The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accommodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

luceneMaxBufferedDocumentsstring-1
= disabled
Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment. Large values generally gives faster indexing.

When this is set, the writer will flush every luceneMaxBufferedDocuments added documents. Pass in -1 to prevent triggering a flush due to number of buffered documents. Note that if flushing by RAM usage is also enabled, then the flush will be triggered by whichever comes first.

Disabled by default (writer flushes by RAM usage).

luceneMergeFactorint2Sets the index writer merge factor.
luceneRAMBufferSizeMBint2048Sets the index writer RAM buffer size in MB.
autoCommitMSlong0
= disabled
The time (in ms) bewteen commits of the index. If set to 0, auto-commit based on time is disabled. This index is only committed if documents have been added since the last commit.
autoCommitMSlong0
= disabled
The maximum number of documents that can be added between commits of the index. If set to 0, auto-commit based on document submission is disabled.
autoCommitSpinWaitlong1000 ms
= 1 s
The spin wait time for the thread performing auto-commits (if enabled). The thread wakes this often to check whether the time and document threshold have been passed and commits if required.

Example Configuration

Simple

    <component name="LuceneService" subType="default" factoryName="aspire-lucene"/>

Complex

    <component name="LuceneIndexer" subType="default" factoryName="aspire-lucene">
      <indexDirectory>data/index/lucene-index</indexDirectory>
      <documentID>url</documentID>
      <luceneMaxFieldLength>10000</luceneMaxFieldLength>
      <luceneMaxBufferedDocuments>100</luceneMaxBufferedDocuments>
      <autoCommitSpinWait>5000</autoCommitSpinWait>
      <autoCommitMS>1800000</autoCommitMS>
      <autoCommitDocs>10000</autoCommitDocs>
    </component>

Accessing from External Components

In order to use the index and searching capabilities of this component, you must configure the <indexDirectory> parameter. Services are then provided using the AspireLucene.java interface.

Components wishing to access this functionality should main a service tracker to this component, get an instance an then call the appropriate method. See here for further details.

  • No labels