You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Current »


Publishing to the File System Staging Repository

To publish to any staging repository, simply replace the usual search engine publisher in the work flow with a publisher for the desired staging repository.

Select the File System Staging Repository from the Publishers section of the workflow, or install a custom publisher with the coordinates com.searchtechnologies.aspire:app-file-repo-publisher.

At the configuration screen, you configure the repository location. This is the directory on disk that will be the base directory for the repository. All information will be stored under this directory, so you should ensure that the directory is on a disk with sufficient capacity.

You may choose to compress or encrypt the data. If you turn on encryption, you must choose the encryption algorithm and configure a password. Be default, the publisher will only publish the document metadata to the repository. If the connector crawling the original content source repository produces a stream, say to a file or attachment, you may choose to publish this stream to the repository as well. If you do, be aware that the stream can only be published if it has not already been consumed by some other stage such as extract text. In fact, for most Aspire connectors, you will need to disable extract text in the advanced configuration if you wish to save the stream in the repository. If you plan to use more than one Java virtual machine to access the file staging reposition at the same time (for example if you are using failover or distributed processing) you should turn on file locking to ensure the transaction log is consistent.

Content Source and Owner

When you configure the publisher, you will optionally configure the content source and owner. These will determine the exact location of the published item in the repository. If you don’t specify the content source, it will be taken from the document being published. If you don’t specify the owner, it will default to default.

Real Time Updates

The publisher supports sending of JMS messages when transactions occur. This allows the publisher to be closely coupled to a second crawler, allowing a crawl of an original content source repository to publish to a staging repository that submits an event to a JMS queue that is read by another crawler. This allows the separation of the crawl and index processes as described in the introduction.

If you turn on this functionality, you will need to configure the JMS server and queue to connect to. Currently the publisher only supports ActiveMQ and you can use an external broker, or install the Aspire JMS Server service.

JMS Message Format

If real time updates are configured, JMS messages are emitted in the following format:

<transactions>
   <transaction id="1234" timestamp="2014/07/13T12:34:56Z" action="[add|update|delete]">
      <item id="????" repositoryLocation="????" contentSource="????" owner="????"/>
   </transaction>
</transactions>
  • No labels