Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Warning

This feature is currently in Beta and only available in a SNAPSHOT version

In Aspire, we try to be as efficient as possible, especially when passing round large binary objects such as files or attachments. We have processes called “Fetchers”, but they typically don’t really fetch things. Instead they open a stream to the object and so the transfer happens when something such as text extraction reads the data. This means we’re not transferring the data (possibly over some slow network route) until we know we need it. It also means we don’t have to hold the entire object (which could be gigabytes) in memory. Streams work on bytes at a time.

However, this approach comes with a disadvantage. Whilst it is sometimes possible to “backup” on a stream to process the data more than once, typically once you’ve processed the data it’s gone. If you wanted to process it a second time, you’d have to open another stream and transfer it again – again possibly over that slow network. A binary store would give use the ability (when we know we need to process the file more than once) to store the file more locally.

With other use cases, we might produce binary objects (document rendering for example) and need a place to store them before presenting via a UI. A binary store would also work in this scenario.

However, what is a good medium for the store? A local file system, S3, Azure, something else? Any binary store we have would benefit from being built in such a way to abstract the Aspire workflow components from the actual code that writes to the medium. That way the workflow can just “read” or “write”, with something else worrying about where it’s written to or read from.

I’ve mentioned reading and writing, but it would be useful to delete from the store, or possibly clear it as well.

Architecture

The background processing store’s architecture is shown below

Image Added

Each binary store inside Aspire is implemented as an Aspire Service that implements a Java interface. This interface allows for all the common storage actions – read, write, delete, clear store and so on. The service implements the code required to perform those actions for the given store – local file system, S3 or Azure. Components are implemented that call in to the services and perform the desired storage actions – write, read, delete.

This approach has a number of advantages:

  • It reduces the total number of Aspire components
    • There is a “read” component, “write” component and “delete” component and one service per supported store versus a “read” component for each supported store, a “write” component for each supported store and so on
  • It makes configuration easier
    • If a store requires a lot of configuration including authorisation parameters, these need only be entered once, rather than once for every pipeline component that needs to connect to the store
  • It makes migration between store types easier
    • If a pipeline module wants to write to a store of a different type, it simply required the administrator to change the configuration of the pipeline to reference a different service rather than replacing the “write to local file” component with a “write to S3” component

This architecture also supports having multiple stores of the same type by installing more than one service with different configurations – allowing (say) use of more than one file system directory or more than S3 bucket

Normally in Aspire we’re trying to take data from a source, do some form of processing and publish the transformed data to a search engine. To allow for high throughout and low latency, we typically want the time from when we discover the data to when it’s available in the search engine to be as low as is reasonably possible. Normally we don’t have an issue – the workflow actions are generally quick and the “worst case” is that we’re extracting the text for a very large document that could take a couple of minutes.

But suppose we want to do something more complex that takes longer? Performing optical character recognition (OCR) or rendering content to product thumbnails? Or call out to an external web service that throttles our interaction? In the case of OCR or thumbnailing, we could add more hardware and/or threads and improve the throughput, but in the case of a throttled service, that’s not going to help.

By using background processing, we can just admit up front that we can’t do everything we want to in the desired time and perform the quicker tasks in line and perform the longer, more complex tasks in slower time when the resources allow.

Background processing data flow

Below is an example data flow for background processing.

Image Removed

A typical Aspire workflow has a connector extracting data from a repository, performing text extraction and publishing to a search engine. To enable background processing, we simply add an extra workflow task to “publish” the document to a queue. This is shown on the top row of the diagram.

On the bottom row of the diagram, we have the “processing” side. A special connector reads from the queue. The workflow is the same as any other connector – we perform some tasks and then publish to the search engine. Assuming the search engine supports it, we can just pass an update – there’s no need to resend the text content we extract on the top row unless it’s been changed.

As part of the 3.0 connector framework, we developed a distributed queuing mechanism based around MongoDB. This was later extended to allow other “providers” to be used and we currently support MongoDB, HBase and Elasticsearch. This same queuing mechanism is for background processing.

Binary data flow

There is however a limitation to the flow above – the binary file extracted from the repository. In Aspire we try to be as efficient as we can and use streams where ever possible. The binary stream of data from the data repository is consumed by the extract text stage and no longer available. If we need that stream for background processing, we’ll need to do something extra. Using Aspire’s [link to Binary Store], we add an extra stage to both parts of the processing. The data flow is now as below.

Image Removed

...

.