Extracting custom mapped metadata for use in Amazon CloudSearch (Aspire 2)

Prerequisites

Familiarity with Amazon S3. At least have created a bucket and added document metadata. If not, take a look at Amazon S3 Documentation.
Familiarity with Amazon CloudSearch. Check Amazon CloudSearch Documentation if not.
Installed Aspire Amazon S3 connector.
Installed Aspire Amazon CloudSearch publisher.

Step 1: Create custom metadata fields in an S3 bucket

Create two custom metadata fields, “x-amz-meta-field1” and “x-amz-meta-field2” in a sample S3 bucket.

No mapping or clean-up is necessary to remove the “x-amz-meta-” prefix from custom metadata fields as the Aspire S3 connector removes that by default.

After crawling with the S3 connector, it will look like this in the AspireObject:

 <doc>
    <fetchUrl>Test/ReadMe.txt</fetchUrl>
    <docType>FILE</docType>
    <lastModified>2012-11-15T17:27:58Z</lastModified>
    <dataSize>316</dataSize>
    <contentType source="ExtractTextStage/Content-Type">text/plain</contentType>
    <crawlId>10</crawlId>
    <content source="ExtractTextStage">Read me file content</content>
    <connectorSpecific type="s3">
      <field name="ETag">cf31821f5aa14facf1bcd714eeb31f6a</field>
      <field name="Content-Length">316</field>
      <field name="Last-Modified">Thu Nov 15 11:27:58 CST 2012</field>
      <field name="Accept-Ranges">bytes</field>
      <field name="Content-Type">text/rtf</field>
      <field name="field1">test1</field>
      <field name="field2">test2</field>
    </connectorSpecific>
    <snapshotUrl>003 my-first-s3-bucket-1-0000000001/Test/ReadMe.txt</snapshotUrl>
    <action>add</action>
    ...
    ...
  </doc>

See S3 Application Bundle connector documentation for more details.

Create two custom metadata fields

Step 3: Map custom fields in the publisher XSL

Download File:AspireToCloudSearch.xsl. This is the default XSLusedbypublish to Cloud Search.

(1.2 Release) The default transformation XSL file provided by the publisher expects metadata as described in Connector AspireObject Metadata.

To map custom fields to Amazon CloudSearch, add the following section to the <add> element in the XSL (feel free to remove any unnecessary fields):

  <xsl:if test="connectorSpecific/field[@name = 'field1']">
     <field name="field1"><xsl:value-of select="connectorSpecific/field[@name = 'field1']" /></field>
  </xsl:if>

  <xsl:if test="connectorSpecific/field[@name = 'field2']">
     <field name="field2"><xsl:value-of select="connectorSpecific/field[@name = 'field2']" /></field>
  </xsl:if>

The same technique may be used for further custom fields.

Step 4: Select CloudSearch document ID

CloudSearch requires that you define a field to be used as document ID, which uniquely identifies each document.

For this tutorial, the MD5 of the document URL is used as the document ID.

You can find the MD5 field at “doc/ancestorInformation/itemId/@md5" of each AspireObject.

Example AspireObject:

  <doc>
     ...
     ...
    <ancestorInformation>
      <parentId md5="2DAC897C070CEC260E1746EDB3873736">my-first-s3-bucket-1-0000000001/Work</parentId>
      <itemName>Songs.txt</itemName>
      <ancestorsDisplay>my-first-s3-bucket-1-0000000001/Work</ancestorsDisplay>
      <itemType>text/plain</itemType>
      <itemType>text/plain</itemType>
      <ancestors>
        <ancestorId md5="2DAC897C070CEC260E1746EDB3873736">my-first-s3-bucket-1-0000000001/Work</ancestorId>
        <ancestorId md5="26A73A8B9D65D99DA25FF917AF14D7E6">my-first-s3-bucket-1-0000000001</ancestorId>
      </ancestors>
     <ancestorTypes>BUCKET\\FOLDER</ancestorTypes>
      <itemLevel>3</itemLevel>
      <'''itemId md5'''="36FAF68F40825849DE2DFD589C3A2D8F">my- s3-bucket-1-0000000001/Work/Songs.txt<'''/itemId'''>
    </ancestorInformation>
    <'''crawlId'''>10<'''/crawlId'''>
  </doc>

Update the XSL accordingly by inserting the following right below the "add" element:

  <xsl:variable name="lcletters">abcdefghijklmnopqrstuvwxyz</xsl:variable>
  <xsl:variable name="ucletters">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable>
  <xsl:attribute name="id">
     <xsl:variable name="toconvert"><xsl:value-of select="'''ancestorInformation/itemId/@md5'''"/></xsl:variable>
     <xsl:value-of select="translate($toconvert,$ucletters,$lcletters)"/>
  </xsl:attribute>

The translate() function is used to convert the ID to lower case. CloudSearch demands that IDs are lower case letters and numbers, only.

Step 5: Select CloudSearch version field

CloudSearch also requires that you define a version field, which increases as new document versions are submitted.

The field selected in this tutorial iscrawlID, which is added to each AspireObject created by the S3 connector (when running from a content source in the Admin UI). The crawl ID is increased every time you run a content source (clicking on "Full", "Update" or from a schedule).

To update the XSL, add the following right below the "add" element:

<xsl:attribute name="version"><xsl:value-of select="crawlId" /></xsl:attribute>

Step 6: Update the XSL used by the Publish to CloudSearch application

To update the XSL used by your Publish to CloudSearch application, do the following:

Copy the modified XSL to ${aspire.home}/config/xsl/ folder in your Aspire distribution (where you are running Aspire from). Download File:CustomMetadata-AspireToCloudSearch.xsl to see the final version of the XSL.
Go to the Aspire Home and update your installed Publish to CloudSearch application.
1. Click on the Content Source.
2. Click on the Workflow tab and drag and drop or double-click on the existing Publish to CloudSearch application.
3. Enter the path to the updated XSL. This path is relative to ${aspire.home}, so there is no leading “/” character. You can use the same “config/xsl/aspireToCloudSearch.xsl”.
4. Click on the “Add” button.

After that, run your S3 connector source and you should see the custom fields on CloudSearch.

2-1. Click on the Content Source

2-2. Drag and drop the Publish To CloudSearch on the onPublish Event

2-3. Enter the updated XSL File Path

Handling deletes

To handle deletes, install a scripting application to encode the document URL to MD5. The AspireObject for deletes is slightly different and does not contain the MD5 by default. Example:

  <doc>
     <folder>Test/1995/ed465240.pdf</folder>
     <fetchUrl>my-s3-bucket-1-0000000001/Test/1995/ed465240.pdf</fetchUrl>
     <docType>FILE</docType>
     <snapshotUrl>004 my--s3-bucket-1-0000000001/Test/1995/ed465240.pdf</snapshotUrl>
     <bucket>my-s3-bucket-1-0000000001</bucket>
     <action>delete</action>
     <connectorSource type="s3">
        <parentUrl/>
        <displayName>TestS3</displayName>
        <indexFolders>true</indexFolders>
        <scanSubFolders>true</scanSubFolders>
      </connectorSource>
     <crawlId>21</crawlId>
  </doc>

Adding the scripting application

Go to the Content Source configuration and click on the workflow tab.
Click on the Base Functions drag and drop the Custom (Custom Groovy Script) to the OnPublish event.
Edit description of the groovy script.
In the Script text field, enter the Groovy script below, then click on the “Add” button:

import com.searchtechnologies.aspire.framework.utilities.StringUtilities;
   if(doc.getText("action") == "delete") {
   doc.add("md5", StringUtilities.stringToMD5(doc.getText("fetchUrl")));
   }

The import is required to be able to use the class StringUtilities.
The method doc.getText will read a direct child of the AspireObject.
<doc><action/></doc> will always be “delete” in when a delete was detected in S3. We only want to do something special in those situations.
The method doc.add will add a new child to the AspireObject. In this case, named “md5”.
The method StringUtilities.stringToMD5 converts a string to its MD5 value.

Make sure you add the new scripting into the OnPublish event. In this example, the scripting application is named “HandleDeletes”. Place it before your PublishToCloudSearch application.

After this, you will start seeing deletions in CloudSearch. More documentation on Amazon CloudSearch document deletion here. Finally, all this customization is required because Amazon CloudSearch fields are not predefined. This means, that this the XSL customization has to be done for each set of CloudSearch index fields you have.

Workflow Tab

Install the Scripting App

Enter the groovy script

Save it

Summary

There is no need to remove the “x-amz-meta-“ prefix, as that is automatically removed by the connector.
You have to make sure your CloudSearch fields have the “results” column on. Otherwise, they won’t show up in the search results and will be used for search only.
You need to map any AspireObject fields you need to CloudSearch. Extend the XSL as needed to add as many fields as you need.
You need a document version, e.g., select the crawl ID, as this is incremented after each run of the same connector source.
You need a document ID, e.g., select the MD5 of the document URL as it uniquely identifies your documents.

Page tree