The Aspire Google Cloud Search Publisher is made up of two main sets of components: the publisher configuration components and the transformer components.
Complete the following steps in order to configure the Google Cloud Search Publisher.
Step 2. Add a new content source
For this step, please follow the Configuration Tutorial of the connector of your choice. Please refer to the Connector list.
Step 3. Add a transform stage to the workflow
A transform stage will need to create Aspire objects that have the structure described below. These Aspire objects will be used as input to the GCS publisher. There are two mechanisms for creating a transform stage:
- Create a custom Groovy script
- Create a custom Java component
Output Aspire Object from the Transform / Input Aspire Object to the GCS Publisher
The Google Cloud Search Publisher expects an incoming Aspire Object that has these fields:
The unique id for the document
|If true, the ACLs on the item will be set to viewable by the entire G-Suite domain|
doc.hierarchy (Required if using ACL Inheritance or Document Hierarchy, otherwise optional)
|Supplies the following values:|
The content of the document in text format.
The fields that make up the Google Cloud Search document's structured data as specified in the Google Cloud Search schema.
The following property types are supported. The conversion to the appropriate Google Cloud Search property type is automatically performed if the value meets the specified criteria:
Please note that with the exception of Boolean properties, all other properties can be multi-valued. If an Aspire property for the doc.structuredData holds any other type of value than those specified above, the property will not be added with the document in the index.
Holds properties that will be loaded in the Google Cloud Search structured metadata as enum fields.
TEXT or RAW; overrides the config in the publisher.
|doc.content (optional)||Holds the text if the contentFormat set in the aspire object or the publisher config is set to TEXT.|
|job.variable['contentStream'] (optional)||Holds the input stream to be used for sending the content in RAW format.|
|job.variable['contentBytes'] (optional)||Holds the bytes array to be used for sending the content in RAW format. The contentStream takes precedence over the contentBytes if both are set.|
Holds ACL information about a document's ACL in the typical format produced by Aspire connectors.
For more info on the structure of an ACL in Google Cloud Search, please visit this link.
Add, update or delete.
Overrides the data source id set in the configuration
|doc.indentitySourceId (optional)||Overrides the identity source id set in the configuration|
|doc.aclInheritanceType (optional)||Overrides the ACL inheritance type set in the configuration|
|doc.isContainer (optional, defaults to true)||If true, the document is indexed as a CONTAINER_ITEM, if false it will be indexed as a CONTENT_ITEM|
The document's Google Cloud Search metadata
Example Output Object
Example of a serialized Aspire Object that would work as an input to the Google Cloud Search publisher:
Step 4. Add a GCS Publisher to the Worklfow
To add a Publish to Google Cloud Search, drag from the Publish to Google Cloud Search rule from the Workflow Library and drop to the Workflow Tree where you want to add it.
This will automatically open the Publish to Google Cloud Search window for the configuration of the publisher.
1. Credentials Key File
|Path to the file that contains the credentials needed to connect to Google Cloud Search. Follow Google's instructions to create a credentials file.|
|2. GCS Data Source Id||The id for the Google Cloud Search data source that will hold the indexed documents. If the incoming Aspire Object specifies a Data Source Id (datasourceId) this value will override the configuration value.|
|3. GCS API Root Url||The root of the API Url - e.g. https://cloudsearch.googleapis.com|
|4. GCS Identity Source Id||If there is any content that requires security trimming, this config specifies the id of the Google Cloud Identity Source that stores the users and or groups. If the Aspire Object or the Aspire Object ACL specifies an identity source id (identitySourceId), this id will override the identity source specified in the configuration.|
|5. Use Async for Full Crawls||Google Cloud Search allows indexing requests to be sent in either SYNCHRONOUS or ASYNCHRONOUS mode. If this is turned on indexing requests for full crawls will be sent in ASYNCHRONOUS mode. Otherwise all requests will be SYNCHRONOUS.|
|6. Document Hierarchy||Google Cloud Search supports containment relationships for documents stored in a data source. If document hierarchy is turned on, items indexed into Google Cloud Search will be linked to their parent containers.|
|7. ACL Inheritance Type|
Google Cloud Search supports ACL inheritance for documents stored in a data source. If the incoming Aspire Object specifies the ACL inheritance type (aclInheritanceType) this property will override the publisher's configuration. Available options supported by the Google Cloud Search Publisher:
|8. Index Type|
The type and version of API or libraries used for index data into Google Cloud Search:
|9. ACL email entities as G-Suite Principals|
If the ACL entity name is an email set the entity as a GSuitePrincipal in the Google Cloud Search ACL. It applies to all docs/acls going through the publisher. This setting can be overridden for a particular document if the principalType attribute is set to CloudIdentity or GSuitePrincipal in the acl object.
|10. Stream vs Batch||Stream - each record will be sent by itself to Google Cloud Search in the publisher's process method. Batch - all the records created in a batch by a connector's scanner are sent together as a batch to Google Cloud Search in the publisher's endBatch method. Google Cloud Search Item Upload and Medial Upload requests can only be sent one by one (there is no batch option) - these requests are created by the publisher when the content for an item is > 100Kb. Indexing requests cannot have the content sent inline when the content is larger than 100Kb.|
|11. Content Format||The text must be extracted before it gets to the publisher. Raw - the content come in its raw format (as an input stream or byte array). The Content format can be overridden at the aspire object level by setting the doc.contentFormat field to TEXT or RAW.|
|12. Connection timeout||The amount of time to wait (in seconds) before timing out the creation of a Google Cloud Search connection.|
|13. Read timeout||The amount of time to wait (in seconds) before timing out the creation of a read/write operation.|
|14. Retries||The number of times to retry a failed indexing request. Currently only failures due to quota limit errors (http response code 429) are retried. The retries are done in an exponential back off fashion.|
|15. Retry delay||Amount of time (in seconds) to wait until the first retry.|
|16. Maximum retry delay||The maximum amount of time (in minutes) to wait across all retries.|
|17. Retry delay multiplier||The multiplier used to increase the retry delay from the prior retry iteration.|
|18. Debug||When turned off, all exception messages/errors will be written to the logs, including info about retry attempts and details about the content of the records that failed. Records that belong to the same Google Cloud Search batch submission will be tagged with a unique identifier, so they can easily be grouped/found together. When the Debug flag is on summary, info about every record that is successfully published will be printed as well in addition to the info and other general debug messages.|