This feature is currently in Beta and only available in a SNAPSHOT version

When a user performs a search, they expect to get relevant results that give title, authors, dates etc and also a snippet of the document. The snippet is there to give you the context of the hit in the document – to help you select the document you’re likely to be most interested in. However, the snippet will normally be just text and may well have lost much of the visual context, especially if the document was a PowerPoint or a PDF with lots of pictures. If we want to enhance the user’s search experience, one of the simplest things we can do is give a pictorial representation of the document or page (slide) so that the user can have more information to make his/her choice.

Presenting a pictorial representation of the page is not a particularly difficult task to imagine. We just need to render that page (slide) at an appropriate resolution and presenting it through the UI. But rendering a page (or document) is a processor intensive process, so we wouldn’t want to do that at search time otherwise the UI would be too slow. So we’d need to do it at index time – when we don’t know which document or page we need. Now our simple “render a page” has become “render all pages of all documents at index time and store them”. That still doesn’t seem to complex, but it doesn’t come without issues. As mentioned, the rendering of pages (and now documents) is a processor intensive task. If we put that in to an Aspire pipeline before sending the document to the search engine, it could increase the latency of the ingestion process to an unacceptable level. Fortunately, we can mitigate this with Aspire’s back-ground processing. We also need to be able to store the renditions of all those pages. Again, we have a solution – Aspire’s Binary Store.

Thus, we’re starting to build up our solution. We need a connector to get the documents, process them “as normal” and publish them to the search engine. In the background, we’ll create our renditions and send updates to the search engine when we know the names of the files containing the renditions. If we add in the fact that we need to store the original binary (before processing) so that we can perform text extraction and produce the renditions, we have the architecture shown below

Handling Differing Document Types

When rendering the documents, it’s preferable that we can cope with a changing set of document types. Initially we may only want to render Microsoft Office documents (or a subset). Later we may want to render PDF and HTML as well. Our architect should therefore allow for this. It would also be a good idea if we can “choose” what causes a document to be rendered by a particular processor and how it is rendered (resolution etc) with out having to find lots workflow components and change configuration in multiple places. For that reason, we will add a “manager” that holds the configuration for the renditions (resolution etc) and allows us to add “routes” to one or more “rendition producers” based on certain criteria. Our architecture is now as shown below

  • No labels