You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

For this example, we’ll assume that you want to collect data from a file system for a search UI that queries Elastic and provides thumbnails and full-size document renditions for PowerPoint slides and Microsoft Word and PDF documents. The text should be available for searching as quickly as possible, but the renditions can be produced as a background task. The frontend will be responsible for serving the previews and renditions. All we need to do is ensure the path to the rendition is available in the search engine.

As per the architecture above, we’ll end up with two “connectors” and workflows. One will collect files from the filesystem, extract the text and publish to a queue for background processing and also to Elasticsearch. The other will read the background processing queue, produce the renditions and publish the updated information to Elasticsearch. As we need to process the binary file twice, we’ll need a binary store, which we’ll write to as we process in the first workflow and read from in the second, so we can create the thumbnails.

Services

First, you’ll need to install and configure the services we’ll need. These are:

  • A binary store
    • To hold the original binaries so we can process them twice
    • To hold the rendered previews and “originals”
    • A processor for each file type we want to process
      • PPT, Word and PDF
      • A thumbnail manager
        • To hold the configuration of required renditions
          • Size etc
  • To route the binaries to the appropriate processor based on mime type

 

The order in which these services are installed is dictated by the fact that the thumbnail manager needs to be configured with information about the processors and binary store.

Using the instructions here we can install the binary store.

With a binary store configured, we can install the three processors using the instructions [here]. NOTE: the PDF and Word processors need ImageMagick to be installed and have the correct path configured.

PowerPoint

Word

PDF

You may wish to create a group in the services to keep all these “in one place”.

Next, we can add the thumbnail manager. See the instructions [here].

Add one or more rendition configurations. These are just a “name” and a size. The size can be absolute (in pixels) or a fixed height or width (again in pixels) or a percentage of the original.

[manager-thumbs.png]

Add in the three processors, selecting each in turn from the drop down and add one or more mime types to be routed to that processor.

 [manager-proc.png]

By default, the mime type is taken from the “mimeType” in the Aspire document. To change the location, update the configuration in the “document” section

[manager-doc.png]

Finally, configure the binary store, selecting it from the dropdown. You may wish to suffix the content source name extracted from the document, so there’s no chance it “collides” with other sources or the original binary files.

[manager-store.png]

The Connector

As mentioned in the overview, we want to have a “normal” Aspire workflow that collects the data, extracts the text and send it to Elastic. We could do this by simply installing our connector and a publisher to Elastic. However, there are two things to bear in mind:

1)      We’ll need to hold a copy of the binary so that we can process it later

2)      We need to “register” the file for background processing

 

It’s relatively simple to achieve this, but it does require a few more steps in the workflow.

Storing the binary

We’re already got a binary store installed (above in the services). Installing a “writer” in the workflow will store the binary on the Aspire job, but it’s important to note that the stream is currently being consumed by the text extraction stage built in to the connector. Turning that off will allow us to store the binary, but we wanted to text to send to Elastic, so we must add a stage to the workflow after the writer to extract the text (the writer stage will not by default consume the stream).

Our workflow now becomes:

1)      Write binary to store

2)      Extract text

3)      Publish to Elastic

 

Installation of the writer is described [here]. The configuration simply points to the service we added earlier.

[config-writer.png]

Publishing for Background processing

Publishing for background processing is also just a matter of adding another publisher to the workflow. The background queue publisher installation is documented [here], and publisher allows you to configure how much of the document is published in to the queue. The minimum we need is the document id and url, but it will allow be useful for the thumbnail generation if we have the mime type output from text extraction. The transform to save these pieces of information is show below

[transform]

The configuration of the publisher lets us choose the database queue we publish to

[config-queue-publisher.png]

So now our workflow looks like this:

1)      Write binary to store

2)      Extract text

3)      Publish to background queue

4)      Publish to Elastic

 

[workflow-connector.png]

The background processing

We’re already processing the data and sending the text to Elastic. We have the service ready to generate the renditions and the queue of items needing processing is populated. Now we just need to take items from the queue, ask for the renditions to be created and publish information about the renditions to Elastic.

To read the queue, we install a back-ground queue connector. The installation guide is [here]. The configuration allows us to choose the queue we read from. Obviously, we need to set this to the same value for the publisher we configured earlier.

[config-connector.png]

The connector can also be configured to run and exit or run continually until stopped by a user.

The workflow

Earlier we wrote the binary of the original file to a store, and we put information in a queue about the item to process. The connector knows the item to process, but the binary is in the store. The first thing we need to do is to read the binary.

The binary store “reader” component – documented [here] – allows us to do that. The configuration simply that you pick the binary store used from the drop down.

[bg-connecctor-config.png]

Once we’ve picked up the binary, we need to send it to the thumbnail manager service we installed above. For this we use the thumbnail component described [here]. This will connect to the thumbnail manager, send the binary, receive back information about the renditions and write these back in to the Aspire document.

The thumbnail stage only requires us to provide the name of the thumbnail manager (as all the processors and required thumbnails are configured there).

[thumbnail-cfg.png]

Once we’ve produced the thumbnails (and they’ve been written back to the store) we need to update the Elasticsearch index. However, we don’t need to touch Elastic for items that produced no thumbnails, so we’ll add a simple stage that terminates any job that didn’t produce thumbnails. This (custom Groovy) stage has the following code

[code]

One of the advantages of Elastic is that we can make partial document updates (ie just add the thumbnail information to the existing record, rather than replacing the entire index entry). If the engine being used doesn’t have this capability, it’s not a big deal, but you may need to add more (or all) of the metadata to the queue, so you have the correct metadata in the index after the update has been done.

Publishing to Elastic uses the same publisher component we used to publish the text on the connector workflow. We just use a different transform to send the request we want.

[bg-elastic.png]

The transform is shown below:

[elastic-update-transform]

The full pipeline for the pipeline is now

1)      Read from store

2)      Generate thumbnails

3)      Terminate when no thumbnails

4)      Update Elastic

[bg-pipeline.png]

 

  • No labels