The Aspire Google Cloud Search Publisher is made up of two main sets of components: the publisher configuration components and the transformer components.
Complete the following steps in order to configure the Google Cloud Search Publisher.
The Google Cloud Search Publisher allows posting data to Google Cloud Search received from a variety of sources: both Aspire Connectors and Aspire Services.
Make sure to configure the following features:
Feature | Configuration Information |
---|---|
1. Credentials Key File | Path to the file that contains the credentials needed to connect to Google Cloud Search. Follow these instructions to create a credentials file. |
2. GCS Data Source Id | The id for the Google Cloud Searchdatasource that will hold the documents the indexed documents. If the incoming Aspire Object specifies a Data Source Id (datasourceId) this value will override the configuration value. |
3. GCS API Root Url | The root of the API Url - e.g. https://cloudsearch.googleapis.com |
4. GCS Identity Source Id | If there is any content that requires security trimming, this config specifies the id of the Google Cloud Identity Source that stores the users and or groups. If the Aspire Object or the Aspire Object ACL specifies an identity source id (identitySourceId), this id will override the identity source specified in the configuration. |
5. Use Async for Full Crawls | Google Cloud Search allows indexing requests to be sent in either SYNCHRONOUS or ASYNCHRONOUS mode. If this is turned on indexing requests for full crawls will be sent in ASYNCHRONOUS mode. Otherwise all requests will be SYNCHRONOUS. |
6. Document Hierarchy | Google Cloud Search supports containment relationships for documents stored in a data source. If document hierarchy is turned on, items indexed into Google Cloud Search will be linked to their parent containers. |
7. ACL Inheritance Type | Google Cloud Search supports ACL inheritance for documents stored in a data source. If the incoming Aspire Object specifies the ACL inheritance type (aclInheritanceType) this property will override the publisher's configuration. Available options supported by the GCSGoogle Cloud Search Publisher - the meaning for the values is explained here:
|
8. Index Type | The type and version of API or libraries used for index data into Google Cloud Search:
|
9. ACL email entities as G-Suite Principals | If the ACL entity name is an email set the entity as a GSuitePrincipal in the Google Cloud Search ACL. It applies to all docs/acls going through the publisher. This setting can be overridden for a particular document if the principalType attribute is set to CloudIdentity or GSuitePrincipal in the acl object. |
10. Stream vs Batch | Stream - each record will be sent by itself to Google Cloud Search in the publisher's process method. Batch - all the records created in a batch by a connector's scanner are sent together as a batch to Google Cloud Search in the publisher's endBatch method. Google Cloud Search Item Upload and Medial Upload requests can only be sent one by one (there is no batch option) - these requests are created by the publisher when the content for an item is > 100Kb. Indexing requests cannot have the content sent inline when the content is larger than 100Kb. |
11. Content Format | The text must be extracted before it gets to the publisher. Raw - the content come in its raw format (as an input stream or byte array). The Content format can be overridden at the aspire object level by setting the doc.contentFormat field to TEXT or RAW. |
12. Connection timeout | The amount of time to wait (in seconds) before timing out the creation of a Google Cloud Search connection. |
13. Read timeout | The amount of time to wait (in seconds) before timing out the creation of a read/write operation. |
14. Retries | The number of times to retry a failed indexing request. Currently only failures due to quota limit errors (http response code 429) are retried. The retries are done in an exponential back off fashion. |
15. Retry delay | Amount of time (in seconds) to wait until the first retry. |
16. Maximum retry delay | The maximum amount of time (in minutes) to wait across all retries. |
17. Retry delay multiplier | The multiplier used to increase the retry delay from the prior retry iteration. |
18. Debug | When turned off, all exception messages/errors will be written to the logs, including info about retry attempts and details about the content of the records that failed. Records that belong to the same Google Cloud Search batch submission will be tagged with a unique identifier, so they can easily be grouped/found together. When the Debug flag is on summary, info about every record that is successfully published will be printed as well in addition to the info and other general debug messages. |
The Google Cloud Search Publisher expects an incoming Aspire Object that has these fields:
Field | Description |
---|---|
doc.qid (mandatory) | doc.qid (mandatory) - The unique id for the document |
doc.isPublic (optional) | If true, the ACLs on the item will be set to viewable by the entire G-Suite domain |
doc.hierarchy (Required if using ACL Inheritance or Document Hierarchy, otherwise optional) | Supplies the following values:
|
doc.content | The content of the document in text format.
|
doc.structuredData (required) | The fields that make up the Google Cloud Search document's structured data as specified in the Google Cloud Search schema. The following property types are supported. The conversion to the appropriate Google Cloud Search property type is automatically performed if the value meets the criteria specified criteria:
Please note that with the exception of Boolean properties, all other properties can be multi-valued. If an Aspire property for the doc.structuredData holds any other type of value than those specified above, the property will not be added with the document in the index. |
doc.enumsData (optional) | Holds properties that will be loaded in the Google Cloud Search structured metadata as enum fields. |
doc.contentFormat (optional) | TEXT or RAW; overrides the config in the publisher. |
doc.content (optional) | Holds the text if the contentFormat set in the aspire object or the publisher config is set to TEXT. |
job.variable['contentStream'] (optional) | Holds the input stream to be used for sending the content in RAW format. |
job.variable['contentBytes'] (optional) | Holds the bytes array to be used for sending the content in RAW format. The contentStream takes precedence over the contentBytes if both are set. |
doc.acls (optional) | Holds ACL information about a document's ACL in the typical format produced by Aspire connectors. Additional notes:
For more info on the structure of an ACL in Google Cloud Search, please visit this link. |
doc.action (required) | Add, update or delete.
|
doc.datasourceId (optional) | Overrides the data source id set in the configuration |
doc.indentitySourceId (optional) | Overrides the identity source id set in the configuration |
doc.aclInheritanceType (optional) | Overrides the ACL inheritance type set in the configuration |
doc.isContainer (optional, defaults to true) | If true, the document is indexed as a CONTAINER_ITEM, if false it will be indexed as a CONTENT_ITEM |
doc.metadata | The document's Google Cloud Search metadata
|
Example of a serialized Aspire Object that would work as an input to the Google Cloud Search publisher:
{ "doc":{ "id":"Chocolat", "content":"Awesome movie", "metadata":{ "sourceRepositoryUrl":"https:\/\/www.imdb.com\/title\/tt0241303\/", "title":"Chocolat", "objectType":"doc", "contentLanguage":"en", "mimeType":"plain\/text", "keywords":[ "keyword1", "keyword2" ], "hash":"asdfsdagwerew", "searchQualityScore":0.1, "createTime":"2018-08-29T05:42:23.226Z", "updateTime":"2018-08-29T05:42:23.226Z" }, "structuredData":{ "actorName":[ "Johnny Depp", "Juliette Binoche" ], "movieTitle":"Chocolat", "mpaaRating":"G", "duration":5, "created":"Wed Aug 29 06:42:23 IST 2018", "releaseDate":"2018-08-29T05:42:23.226Z", "inTheaters":true, "yearsSinceRelease":2.1 }, "enumsData":{ "genre":"Drama" }, "acls":{ "acl":[ { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" }, { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#Actors" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#ralfaro" }, { "@access":"deny", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jroberts" }, { "@access":"deny", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jdepp", "@identitySourceId":"adde86161b95e302ca1bc4cb18d45d61" }, { "@access":"deny", "@entity":"group", "@principalType":"CloudIdentity", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" } ], "parentAcl":[ { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#confluence-administrators" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#admin" } ] }, "hierarchy":{ "item":{ "@id":"childid2", "@level":"2", "@name":"Aspire", "ancestors":{ "ancestor":{ "@id":"parentid2", "@level":"1", "@name":"Aspire", "@url":"https:\/\/www.imdb.com\/title\/parent\/tt0241303\/" } } } }, "isContainer":false, "action":"add" } }
See the TestIndexGcsPublisher unit test for an example of how to index/delete documents from a Google Cloud Search index.
The associated Google Cloud Search schema can be found here.
Google Cloud Search Transformers are Aspire components that transform incoming Aspire Objects into output Aspire Object that comply to the requirements, needed for processing through the Google Cloud Search Publisher.
The Google Cloud Search Transformers are optional components and they are provided as an example of how to create other custom transformers specific to clients' project needs.
The transformers can be added to a workflow as an Aspire application (it has the following Maven coordinates: com.searchtechnologies.aspire:app-gcs-transforms) before a Google Cloud Search publisher.
The following transformers are implemented: