This feature is currently in Beta and only available in a SNAPSHOT version

Normally in Aspire we’re trying to take data from a source, do some form of processing and publish the transformed data to a search engine. To allow for high throughout and low latency, we typically want the time from when we discover the data to when it’s available in the search engine to be as low as is reasonably possible. Normally we don’t have an issue – the workflow actions are generally quick and the “worst case” is that we’re extracting the text for a very large document that could take a couple of minutes.

But suppose we want to do something more complex that takes longer? Performing optical character recognition (OCR) or rendering content to product thumbnails? Or call out to an external web service that throttles our interaction? In the case of OCR or thumbnailing, we could add more hardware and/or threads and improve the throughput, but in the case of a throttled service, that’s not going to help.

By using background processing, we can just admit up front that we can’t do everything we want to in the desired time and perform the quicker tasks in line and perform the longer, more complex tasks in slower time when the resources allow.

Background processing data flow

Below is an example data flow for background processing.

A typical Aspire workflow has a connector extracting data from a repository, performing text extraction and publishing to a search engine. To enable background processing, we simply add an extra workflow task to “publish” the document to a queue. This is shown on the top row of the diagram.

On the bottom row of the diagram, we have the “processing” side. A special connector reads from the queue. The workflow is the same as any other connector – we perform some tasks and then publish to the search engine. Assuming the search engine supports it, we can just pass an update – there’s no need to resend the text content we extract on the top row unless it’s been changed.

As part of the 3.0 connector framework, we developed a distributed queuing mechanism based around MongoDB. This was later extended to allow other “providers” to be used and we currently support MongoDB, HBase and Elasticsearch. This same queuing mechanism is for background processing.

Binary data flow

There is however a limitation to the flow above – the binary file extracted from the repository. In Aspire we try to be as efficient as we can and use streams where ever possible. The binary stream of data from the data repository is consumed by the extract text stage and no longer available. If we need that stream for background processing, we’ll need to do something extra. Using Aspire’s Binary Store, we add an extra stage to both parts of the processing. The data flow is now as below.

Before we use the stream to extract the test when processing the original document, we add a stage to use Aspire’s binary store capability (see here) and write the binary to the store. Now, when we extract the text, we still have a copy of the original binary to process again (as many times as we want). When we perform the “background” processing, we “read” the file so it’s available to any subsequent workflow stages.

  • No labels