This feature is currently in Beta and only available in a SNAPSHOT version

In Aspire, we try to be as efficient as possible, especially when passing round large binary objects such as files or attachments. We have processes called “Fetchers”, but they typically don’t really fetch things. Instead they open a stream to the object and so the transfer happens when something such as text extraction reads the data. This means we’re not transferring the data (possibly over some slow network route) until we know we need it. It also means we don’t have to hold the entire object (which could be gigabytes) in memory. Streams work on bytes at a time.

However, this approach comes with a disadvantage. Whilst it is sometimes possible to “backup” on a stream to process the data more than once, typically once you’ve processed the data it’s gone. If you wanted to process it a second time, you’d have to open another stream and transfer it again – again possibly over that slow network. A binary store would give use the ability (when we know we need to process the file more than once) to store the file more locally.

With other use cases, we might produce binary objects (document rendering for example) and need a place to store them before presenting via a UI. A binary store would also work in this scenario.

However, what is a good medium for the store? A local file system, S3, Azure, something else? Any binary store we have would benefit from being built in such a way to abstract the Aspire workflow components from the actual code that writes to the medium. That way the workflow can just “read” or “write”, with something else worrying about where it’s written to or read from.

I’ve mentioned reading and writing, but it would be useful to delete from the store, or possibly clear it as well.

Architecture

The background processing store’s architecture is shown below

Each binary store inside Aspire is implemented as an Aspire Service that implements a Java interface. This interface allows for all the common storage actions – read, write, delete, clear store and so on. The service implements the code required to perform those actions for the given store – local file system, S3 or Azure. Components are implemented that call in to the services and perform the desired storage actions – write, read, delete.

This approach has a number of advantages:

  • It reduces the total number of Aspire components
    • There is a “read” component, “write” component and “delete” component and one service per supported store versus a “read” component for each supported store, a “write” component for each supported store and so on
  • It makes configuration easier
    • If a store requires a lot of configuration including authorisation parameters, these need only be entered once, rather than once for every pipeline component that needs to connect to the store
  • It makes migration between store types easier
    • If a pipeline module wants to write to a store of a different type, it simply required the administrator to change the configuration of the pipeline to reference a different service rather than replacing the “write to local file” component with a “write to S3” component

This architecture also supports having multiple stores of the same type by installing more than one service with different configurations – allowing (say) use of more than one file system directory or more than S3 bucket.

  • No labels