This feature is currently in Beta and only available in a SNAPSHOT version

For this example, we’ll assume that you want to collect data from a file system for a search UI that queries Elastic and provides thumbnails and full-size document renditions for PowerPoint slides and Microsoft Word and PDF documents. The text should be available for searching as quickly as possible, but the renditions can be produced as a background task. The frontend will be responsible for serving the previews and renditions. All we need to do is ensure the path to the rendition is available in the search engine.

As per the architecture above, we’ll end up with two “connectors” and workflows. One will collect files from the filesystem, extract the text and publish to a queue for background processing and also to Elasticsearch. The other will read the background processing queue, produce the renditions and publish the updated information to Elasticsearch. As we need to process the binary file twice, we’ll need a binary store, which we’ll write to as we process in the first workflow and read from in the second, so we can create the thumbnails.

Services

First, you’ll need to install and configure the services we’ll need. These are:

  • A binary store
    • To hold the original binaries so we can process them twice
    • To hold the rendered previews and “originals”
  • A processor for each file type we want to process
    • PPT
    • Word
    • PDF
  • A thumbnail manager
    • To hold the configuration of required renditions
      • Size etc
  • To route the binaries to the appropriate processor based on mime type


The order in which these services are installed is dictated by the fact that the thumbnail manager needs to be configured with information about the processors and binary store.

Binary Store

Using the instructions here we can install the binary store.

Rendering Engines

With a binary store configured, we can install the three processors using the instructions here. NOTE: the PDF and Word processors need ImageMagick to be installed and have the correct path configured.

PowerPoint

Word

PDF

Grouping the Engines in the UI

You may wish to create a group in the services to keep all these “in one place”.

Thumbnail Manager

Next, we can add the thumbnail manager. See the instructions here.

Documents

First we configure the document options, including what we'll use as the document it, the content source and the mime type. Pay attention to the mime type field - it's used to route requests to the appropriate engine. If you select the wrong field, your requests my not get routed correctly and you may not get the renditions you expect.

Renditions

Add one or more rendition configurations. These are just a “name” and a size. The size can be absolute (in pixels) or a fixed height or width (again in pixels) or a percentage of the original.

The identifiers (ID) used above can be anything, but care must be taken to ensure the values used here are used later when configuring the Groovy used to post the information to Elasticsearch (or other search engine)

Engines

Add in the three engines, selecting each in turn from the drop down and add one or more mime types to be routed to that engine.

By default, the mime type is taken from the “mimeType” in the Aspire document. To change the location, update the configuration in the “document” section.

Binary Store

Finally, configure the binary store, selecting it from the dropdown. You may wish to suffix the content source name extracted from the document, so there’s no chance it “collides” with other sources or the original binary files.

The Connector

As mentioned in the overview, we want to have a “normal” Aspire workflow that collects the data, extracts the text and send it to Elastic. We could do this by simply installing our connector and a publisher to Elastic. However, there are two things to bear in mind:

  1. We’ll need to hold a copy of the binary so that we can process it later
  2. We need to “register” the file for background processing

It’s relatively simple to achieve this, but it does require a few more steps in the workflow.

The Workflow

Storing the binary

We’re already got a binary store installed (above in the services). Installing a “writer” in the workflow will store the binary on the Aspire job, but it’s important to note that the stream is currently being consumed by the text extraction stage built in to the connector. Therefore, we must TURN OFF text extraction in the connector, allowing us to store the binary via the writer. However, we want the text extracted, so we can send it to Elastic, so we must add a stage to the workflow after the writer to extract the text (the writer stage will not by default consume the stream).

Our workflow therefore looks like this:

  1. Write binary to store
  2. Extract text
  3. Publish to Elastic

Installation of the writer is described here. The configuration simply points to the service we added earlier.

Publishing for Background processing

Publishing for background processing is also just a matter of adding another publisher to the workflow. The background queue publisher installation is documented here, and publisher allows you to configure how much of the document is published in to the queue. The minimum we need is the document id and url, but it will allow be useful for the thumbnail generation if we have the mime type output from text extraction. Fortunately the text extraction will give us the mime type, but to make like easier, we'll normalise the mime types using the mime type normaliser.

Our workflow now looks like this:

  1. Write binary to store
  2. Extract text
  3. Normalise mime types
  4. Publish to background queue
  5. Publish to Elastic

The configuration of the publisher lets us choose the database queue we publish to

The transform used to publish the information we want to the background queue is show below

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/doc">
    <doc>
      <!-- ID -->
      <id>
        <xsl:choose>
          <xsl:when test="id">
            <xsl:value-of select="id" />
          </xsl:when>
          <xsl:when test="bgQueueProcessor/id">
            <xsl:value-of select="bgQueueProcessor/id" />
          </xsl:when>
          <xsl:otherwise>
            ID-NOT-PROVIDED
          </xsl:otherwise>
        </xsl:choose>
      </id>

      <!-- Action -->
      <xsl:choose>
        <xsl:when test="action">
          <action><xsl:value-of select="action" /></action>
        </xsl:when>
        <xsl:when test="bgQueueProcessor/action">
          <action><xsl:value-of select="bgQueueProcessor/action" /></action>
        </xsl:when>
        <xsl:otherwise>
          ID-NOT-PROVIDED
        </xsl:otherwise>
      </xsl:choose>

      <!-- Source Id -->
      <xsl:if test="sourceId">
        <sourceId><xsl:value-of select="sourceId"/></sourceId>
      </xsl:if>

      <!-- Urls -->
      <xsl:if test="url">
        <url><xsl:value-of select="url"/></url>
      </xsl:if>

      <xsl:if test="fetchUrl">
        <fetchUrl><xsl:value-of select="fetchUrl"/></fetchUrl>
      </xsl:if>

      <xsl:if test="displayUrl">
        <displayUrl><xsl:value-of select="displayUrl"/></displayUrl>
      </xsl:if>
	  
      <xsl:if test="mimeType">
        <mimeType><xsl:value-of select="mimeType"/></mimeType>
      </xsl:if>

      <xsl:if test="normalizedMimeType">
        <normalizedMimeType><xsl:value-of select="normalizedMimeType"/></normalizedMimeType>
      </xsl:if>

      <xsl:if test="normalizedMimeName">
        <normalizedMimeName><xsl:value-of select="normalizedMimeName"/></normalizedMimeName>
      </xsl:if>
    </doc>
  </xsl:template>
</xsl:stylesheet>



The background processing

We’re already processing the data and sending the text to Elastic. We have the service ready to generate the renditions and the queue of items needing processing is populated. Now we just need to take items from the queue, ask for the renditions to be created and publish information about the renditions to Elastic.

To read the queue, we install a back-ground queue connector. The installation guide is here. The configuration allows us to choose the queue we read from. Obviously, we need to set this to use the same queue that we configured earlier for the publisher.

The connector can also be configured to run and exit or run continually until stopped by a user.

The workflow

Reading the Binary

Earlier we wrote the binary of the original file to a store, and we put information in a queue about the item to process. The connector knows the item to process, but the binary is in the store. The first thing we need to do is to read the binary.

The binary store “reader” component (documented here) allows us to do that. The configuration simply requires that you pick the binary store used from the drop down.

Rendering

Once we’ve picked up the binary, we need to send it to the thumbnail manager service we installed above. For this we use the thumbnail component described here. This will connect to the thumbnail manager, send the binary, receive back information about the renditions and write these back in to the Aspire document.

The thumbnail stage only requires us to provide the name of the thumbnail manager (as all the processors and required thumbnails are configured there).

Updating Elastic

Once we’ve produced the thumbnails (and they’ve been written back to the store) we need to update the Elasticsearch index. However, we don’t need to touch Elastic for items that produced no thumbnails, so we’ll add a simple stage that terminates any job that didn’t produce thumbnails. This (custom Groovy) stage has the following code

if (doc.thumbnails == null) {
  component.info("Terminating job with no thumbnails: %s", doc.id.text())
  job.terminate()
}

One of the advantages of Elastic is that we can make partial document updates (ie just add the thumbnail information to the existing record, rather than replacing the entire index entry). If the engine being used doesn’t have this capability, it’s not a big deal, but you may need to add more (or all) of the metadata to the queue, so you have the correct metadata in the index after the update has been done.

Publishing to Elastic uses the same publisher component we used to publish the text on the connector workflow. We just use a different transform to send the request we want.

The transform is shown below:

This Groovy script is hard coded to look for the rendition IDs "original" and "preview" created by the examples above. Should you wish to use different (or a different number of) identifiers, you will need to modify the Groovy

def copyThumbnails(thumbs, previewId, originalId) {
  if (thumbs == null)
    return

  // the outputs
  def previews = []
  def originals = []
  
  // Get the pages
  def pages = thumbs.getChildren();
  if (pages == null)
    return
  
  // process each page
  pages.each() { page ->
    def pageThumbs = page.getChildren()
	if (pageThumbs == null)
      return
	pageThumbs.each() { t ->
	  if (t.getAttribute("id") == previewId) {
	    previews.push(t.getAttribute("binaryPath"))
	  }
	  if (t.getAttribute("id") == originalId)
	    originals.push(t.getAttribute("binaryPath"))
	}
  }
  
  builder.previewImages(previews)
  builder.originalImages(originals)
}

//***************************************************
//
// Main routine
//
// Update the ElascticSearch document to add the thumbnails
//

// Action of the job
String action = doc.action.getText();

if ((action == "add") || (action == "update")) {
  /*****************
   * Add or Update *
   *****************/

  // ElascticSearch Header
  builder.update() {
    //Get Type

    '_type' 'aspireDocument'

    '_index' doc.elasticIndex

    // Get ID
    if (!doc.isEmpty("id")) {
      '_id' doc.id
    } else if (!doc.isEmpty("fetchUrl")) {
      '_id' doc.fetchUrl
    } else if (!doc.isEmpty("url")) {
      '_id' doc.url
    } else if (!doc.isEmpty("displayUrl")) {
      '_id' doc.displayUrl
    } else {
      '_id' "ID-NOT-PROVIDED"
    }
  }

  builder.flush()

  // Document Source
  builder.$object() {

    // Need a doc for updates
    builder.doc() {
      // Thumbnails
      // **********************************************************************************************
      // Copy the thumbnail information for the "preview" and "original" renditions. If your renditions
      // are named differently, adjust this code
      // **********************************************************************************************
      copyThumbnails(doc.thumbnails, "preview", "original");

      // Should we update the time somehow?
      // submitTime (new Date());
	}
  }
} else {
  /**********
   * Delete *
   **********/
  // Nothing to do deleteing thumbnails - the regular delete should delete the record
}

The Full Workflow

The full workflow for the background generation of renditions is shown below:

  1. Read from store
  2. Generate thumbnails
  3. Terminate when no thumbnails
  4. Update Elastic


  • No labels