Page tree
Skip to end of metadata
Go to start of metadata

The Aspire Google Cloud Search Publisher is made up of two main sets of components: the publisher configuration components and the transformer components.

Complete the following steps in order to configure the Google Cloud Search Publisher.

On this page

Step 1. Launch Aspire and open the Content Source Management page

Launch Aspire (if it's not already running). See:

Step 2. Add a new content source

For this step, please follow the Configuration Tutorial of the connector of your choice. Please refer to the Connector list.


Step 3. Add a transform stage to the workflow

A transform stage will need to create Aspire objects that have the structure described below. These Aspire objects will be used as input to the GCS publisher. There are two mechanisms for creating a transform stage:

  • Create a custom Groovy script
  • Create a custom Java component 

Output Aspire Object from the Transform / Input Aspire Object to the GCS Publisher

The Google Cloud Search Publisher expects an incoming Aspire Object that has these fields:

FieldDescription

doc.qid (required)

The unique id for the document

doc.isPublic (optional)

If true, the ACLs on the item will be set to viewable by the entire G-Suite domain

doc.hierarchy (Required if using ACL Inheritance or Document Hierarchy, otherwise optional)

Supplies the following values:
  • The Google Cloud Search document (doc.hierarchy.item[@id])
  • The document's parent id (doc.hierarch.ancestors.ancestor[where @level == doc's @level - 1][@id])
doc.content 

The content of the document in text format.

  • Set to NO-CONTENT if does not exist.

doc.structuredData (required)

The fields that make up the Google Cloud Search document's structured data as specified in the Google Cloud Search schema.

The following property types are supported. The conversion to the appropriate Google Cloud Search property type is automatically performed if the value meets the specified criteria:

  • text - if the Aspire property holds a String value
  • date - if the Aspire property holds a java.util.Date value
  • timestamp - if the Aspire property holds a java.time.Instant value
  • integer - if the Aspire property holds an Integer or Long value
  • double - if the Aspire property holds a Double value
  • Boolean - if the Aspire property holds a Boolean value

Please note that with the exception of Boolean properties, all other properties can be multi-valued. If an Aspire property for the doc.structuredData holds any other type of value than those specified above, the property will not be added with the document in the index.

doc.enumsData (optional)

Holds properties that will be loaded in the Google Cloud Search structured metadata as enum fields.

doc.contentFormat (optional)

TEXT or RAW; overrides the config in the publisher.

doc.content (optional)Holds the text if the contentFormat set in the aspire object or the publisher config is set to TEXT.
job.variable['contentStream'] (optional)Holds the input stream to be used for sending the content in RAW format.
job.variable['contentBytes'] (optional)Holds the bytes array to be used for sending the content in RAW format. The contentStream takes precedence over the contentBytes if both are set.
doc.acls (optional)

Holds ACL information about a document's ACL in the typical format produced by Aspire connectors.

Additional notes:

    • the parentAcl and acl fields will be combined into a single list when sent to Google Cloud Search if the aclInheritanceType is NOT_APPLICABLE
    • for aclInheritanceType <> NOT_APPLICABLE only the "acl" will be loaded.
    • for Aspire principal names constructed with # inside them, only the part after the last # will be sent as the principal name to Google Cloud Search.
    • user principal type - processing rules:
      • if the acl item has a principalType attribute with a value of CloudIdentity, the acl will be added with a userResourceName or groupResourceName
      • if the acl item has a principalType attribute with a value of GSuitePrincipal, the acl will be added with a G-Suite principal
      • if there is no principalType attribute that matches the CloudIdentity or GSuitePrincipal values and if the principal name matches an email regex pattern and the "ACL email entities as GSuite Principals" setting is set to true, the principal will be sent as a GSuite principal
      • otherwise the principal will be considered to a Cloud Identity principal (i.e. the userResourceName or groupResourceName will be set)
    • an acl or parentAcl object can have an indentitySourceId attribute which will override the identity source specified in the publisher configuration or at the Aspire Object level

For more info on the structure of an ACL in Google Cloud Search, please visit this link.

doc.action (required)

Add, update or delete.

  • Please note that an object with action=crawlStart and attribute full=true will trigger the Publisher to switch to ASYNCHRONOUS mode for all subsequent indexing requests (assuming the Use Async for Full Crawls is turned on in the configuration)
  • For deletes, only the doc.hierarchy.item[@id] or doc.id are required

doc.datasourceId (optional) 

Overrides the data source id set in the configuration

doc.indentitySourceId (optional) Overrides the identity source id set in the configuration
doc.aclInheritanceType (optional) Overrides the ACL inheritance type set in the configuration
doc.isContainer (optional, defaults to true) If true, the document is indexed as a CONTAINER_ITEM, if false it will be indexed as a CONTENT_ITEM
doc.metadata

 The document's Google Cloud Search metadata

    • doc.metadata.objectType (required) - the type of object against which the document is supposed to be index. This is the name of the object definition in a Google Cloud Search schema
    • doc.metadata.title (required) - the document's title
    • doc.metadata.sourceRepositoryUrl (required) - the url at which the document can be found in the repository (e.g. fecthUrl or displayUrl)
    • doc.metadata.createTime (optional) - the time the document was created, it needs to have a valuex` of java.time.Instant type
    • doc.metadata.updateTime (optional) - the time the document was created, it needs to have a value of java.time.Instant type
    • doc.metadata.mimeType (optional) - the mime type for the document
    • doc.metadata.contentLanguage (optional) - 2 letter code for the language; if not supplied, Google Cloud Search will do language detection on the content
    • doc.metadata.keywords (optional) - the keywords associated with the document, needs to be a List type
    • doc.metadata.searchQualityScore (optional) - influences the relevance of the document; needs to be a Double with a value between 0 and 1
    • doc.metadata.hash (optional) - a document's hash, useful if the publisher will do push operations in the future


Example Output Object

Example of a serialized Aspire Object that would work as an input to the Google Cloud Search publisher:

{
   "doc":{
      "id":"Chocolat",
      "content":"Awesome movie",
      "metadata":{
         "sourceRepositoryUrl":"https:\/\/www.imdb.com\/title\/tt0241303\/",
         "title":"Chocolat",
         "objectType":"doc",
         "contentLanguage":"en",
         "mimeType":"plain\/text",
         "keywords":[
            "keyword1",
            "keyword2"
         ],
         "hash":"asdfsdagwerew",
         "searchQualityScore":0.1,
         "createTime":"2018-08-29T05:42:23.226Z",
         "updateTime":"2018-08-29T05:42:23.226Z"
      },
      "structuredData":{
         "actorName":[
            "Johnny Depp",
            "Juliette Binoche"
         ],
         "movieTitle":"Chocolat",
         "mpaaRating":"G",
         "duration":5,
         "created":"Wed Aug 29 06:42:23 IST 2018",
         "releaseDate":"2018-08-29T05:42:23.226Z",
         "inTheaters":true,
         "yearsSinceRelease":2.1
      },
      "enumsData":{
         "genre":"Drama"
      },
      "acls":{
         "acl":[
            {
               "@access":"allow",
               "@entity":"group",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#employees@cagsearchdemo.com"
            },
            {
               "@access":"allow",
               "@entity":"group",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#Actors"
            },
            {
               "@access":"allow",
               "@entity":"user",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#thanks@cagsearchdemo.com"
            },
            {
               "@access":"allow",
               "@entity":"user",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#ralfaro"
            },
            {
               "@access":"deny",
               "@entity":"user",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jroberts"
            },
            {
               "@access":"deny",
               "@entity":"user",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jdepp",
               "@identitySourceId":"adde86161b95e302ca1bc4cb18d45d61"
            },
            {
               "@access":"deny",
               "@entity":"group",
               "@principalType":"CloudIdentity",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#actors@cagsearchdemo.com"
            }
         ],
         "parentAcl":[
            {
               "@access":"allow",
               "@entity":"group",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#confluence-administrators"
            },
            {
               "@access":"allow",
               "@entity":"user",
               "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#admin"
            }
         ]
      },
      "hierarchy":{
         "item":{
            "@id":"childid2",
            "@level":"2",
            "@name":"Aspire",
            "ancestors":{
               "ancestor":{
                  "@id":"parentid2",
                  "@level":"1",
                  "@name":"Aspire",
                  "@url":"https:\/\/www.imdb.com\/title\/parent\/tt0241303\/"
               }
            }
         }
      },
      "isContainer":false,
      "action":"add"
   }
} 

Step 4. Add a GCS Publisher to the Worklfow


To add a Publish to Google Cloud Search, drag from the Publish to Google Cloud Search rule from the Workflow Library and drop to the Workflow Tree where you want to add it.

This will automatically open the Publish to Google Cloud Search window for the configuration of the publisher.

Configuration Settings

FeatureConfiguration Information

1. Credentials Key File

Path to the file that contains the credentials needed to connect to Google Cloud Search. Follow Google's instructions to create a credentials file.
2. GCS Data Source IdThe id for the Google Cloud Search data source that will hold the indexed documents. If the incoming Aspire Object specifies a Data Source Id (datasourceId) this value will override the configuration value.
3. GCS API Root UrlThe root of the API Url - e.g. https://cloudsearch.googleapis.com
4. GCS Identity Source IdIf there is any content that requires security trimming, this config specifies the id of the Google Cloud Identity Source that stores the users and or groups. If the Aspire Object or the Aspire Object ACL specifies an identity source id (identitySourceId), this id will override the identity source specified in the configuration.
5. Use Async for Full CrawlsGoogle Cloud Search allows indexing requests to be sent in either SYNCHRONOUS or ASYNCHRONOUS mode. If this is turned on indexing requests for full crawls will be sent in ASYNCHRONOUS mode. Otherwise all requests will be SYNCHRONOUS.
6. Document HierarchyGoogle Cloud Search supports containment relationships for documents stored in a data source. If document hierarchy is turned on, items indexed into Google Cloud Search will be linked to their parent containers.
7. ACL Inheritance Type

Google Cloud Search supports ACL inheritance for documents stored in a data source. If the incoming Aspire Object specifies the ACL inheritance type (aclInheritanceType) this property will override the publisher's configuration. Available options supported by the Google Cloud Search Publisher:

  • Not Applicable (NOT_APPLICABLE)
  • Child Override (CHILD_OVERRIDE)
  • Parent Override (PARENT_OVERRIDE)
  • Both Permit (BOTH_PERMIT)
8. Index Type

The type and version of API or libraries used for index data into Google Cloud Search:

  • Client Library - v1 - V1 of the lower level Google Cloud Search APIs, Java library generated in an automated fashion from the exposed HTTP REST API. Batching and retry logic is implemented in the Aspire publisher code.
  • Indexing SDK - v1 (Not implemented yet) - V1 of a higher level indexing SDK which provides additional functionality on top of the Client Library: retry logic, client side quota management, batch indexing. This is the same SDK that the Google Cloud Search connector framework uses.
9. ACL email entities as G-Suite Principals 

If the ACL entity name is an email set the entity as a GSuitePrincipal in the Google Cloud Search ACL. It applies to all docs/acls going through the publisher. This setting can be overridden for a particular document if the principalType attribute is set to CloudIdentity or GSuitePrincipal in the acl object.

10. Stream vs BatchStream - each record will be sent by itself to Google Cloud Search in the publisher's process method. Batch - all the records created in a batch by a connector's scanner are sent together as a batch to Google Cloud Search in the publisher's endBatch method. Google Cloud Search Item Upload and Medial Upload requests can only be sent one by one (there is no batch option) - these requests are created by the publisher when the content for an item is > 100Kb. Indexing requests cannot have the content sent inline when the content is larger than 100Kb.
11. Content FormatThe text must be extracted before it gets to the publisher. Raw - the content come in its raw format (as an input stream or byte array). The Content format can be overridden at the aspire object level by setting the doc.contentFormat field to TEXT or RAW.
12. Connection timeoutThe amount of time to wait (in seconds) before timing out the creation of a Google Cloud Search connection.
13. Read timeoutThe amount of time to wait (in seconds) before timing out the creation of a read/write operation.
14. RetriesThe number of times to retry a failed indexing request. Currently only failures due to quota limit errors (http response code 429) are retried. The retries are done in an exponential back off fashion.
15. Retry delayAmount of time (in seconds) to wait until the first retry.
16. Maximum retry delayThe maximum amount of time (in minutes) to wait across all retries.
17. Retry delay multiplierThe multiplier used to increase the retry delay from the prior retry iteration.
18. DebugWhen turned off, all exception messages/errors will be written to the logs, including info about retry attempts and details about the content of the records that failed. Records that belong to the same Google Cloud Search batch submission will be tagged with a unique identifier, so they can easily be grouped/found together. When the Debug flag is on summary, info about every record that is successfully published will be printed as well in addition to the info and other general debug messages.