Prerequisites
- Familiarity with Amazon S3. At least have created a bucket and added document metadata. If not, take a look at Amazon S3 Documentation.
- Familiarity with Amazon CloudSearch. Check Amazon CloudSearch Documentation if not.
- Installed Aspire Amazon S3 connector.
- Installed Aspire Amazon CloudSearch publisher.
Step 3: Map custom fields in the publisher XSL
Download File:AspireToCloudSearch.xsl. This is the default XSLusedbypublish to Cloud Search.
(1.2 Release) The default transformation XSL file provided by the publisher expects metadata as described in Connector AspireObject Metadata.
To map custom fields to Amazon CloudSearch, add the following section to the <add> element in the XSL (feel free to remove any unnecessary fields):
<xsl:if test="connectorSpecific/field[@name = 'field1']"> <field name="field1"><xsl:value-of select="connectorSpecific/field[@name = 'field1']" /></field> </xsl:if> <xsl:if test="connectorSpecific/field[@name = 'field2']"> <field name="field2"><xsl:value-of select="connectorSpecific/field[@name = 'field2']" /></field> </xsl:if>
The same technique may be used for further custom fields.
Step 4: Select CloudSearch document ID
CloudSearch requires that you define a field to be used as document ID, which uniquely identifies each document.
For this tutorial, the MD5 of the document URL is used as the document ID.
You can find the MD5 field at “doc/ancestorInformation/itemId/@md5" of each AspireObject.
Example AspireObject:
<doc> ... ... <ancestorInformation> <parentId md5="2DAC897C070CEC260E1746EDB3873736">my-first-s3-bucket-1-0000000001/Work</parentId> <itemName>Songs.txt</itemName> <ancestorsDisplay>my-first-s3-bucket-1-0000000001/Work</ancestorsDisplay> <itemType>text/plain</itemType> <itemType>text/plain</itemType> <ancestors> <ancestorId md5="2DAC897C070CEC260E1746EDB3873736">my-first-s3-bucket-1-0000000001/Work</ancestorId> <ancestorId md5="26A73A8B9D65D99DA25FF917AF14D7E6">my-first-s3-bucket-1-0000000001</ancestorId> </ancestors> <ancestorTypes>BUCKET\\FOLDER</ancestorTypes> <itemLevel>3</itemLevel> <'''itemId md5'''="36FAF68F40825849DE2DFD589C3A2D8F">my- s3-bucket-1-0000000001/Work/Songs.txt<'''/itemId'''> </ancestorInformation> <'''crawlId'''>10<'''/crawlId'''> </doc>
Update the XSL accordingly by inserting the following right below the "add" element:
<xsl:variable name="lcletters">abcdefghijklmnopqrstuvwxyz</xsl:variable> <xsl:variable name="ucletters">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable> <xsl:attribute name="id"> <xsl:variable name="toconvert"><xsl:value-of select="'''ancestorInformation/itemId/@md5'''"/></xsl:variable> <xsl:value-of select="translate($toconvert,$ucletters,$lcletters)"/> </xsl:attribute>
The translate() function is used to convert the ID to lower case. CloudSearch demands that IDs are lower case letters and numbers, only.
Step 5: Select CloudSearch version field
CloudSearch also requires that you define a version field, which increases as new document versions are submitted.
The field selected in this tutorial iscrawlID, which is added to each AspireObject created by the S3 connector (when running from a content source in the Admin UI). The crawl ID is increased every time you run a content source (clicking on "Full", "Update" or from a schedule).
To update the XSL, add the following right below the "add" element:
<xsl:attribute name="version"><xsl:value-of select="crawlId" /></xsl:attribute>
Summary
- There is no need to remove the “x-amz-meta-“ prefix, as that is automatically removed by the connector.
- You have to make sure your CloudSearch fields have the “results” column on. Otherwise, they won’t show up in the search results and will be used for search only.
- You need to map any AspireObject fields you need to CloudSearch. Extend the XSL as needed to add as many fields as you need.
- You need a document version, e.g., select the crawl ID, as this is incremented after each run of the same connector source.
- You need a document ID, e.g., select the MD5 of the document URL as it uniquely identifies your documents.