The Aspire Google Cloud Search Publisher is made up of two main sets of components: the publisher configuration components and the transformer components.
Complete the following steps in order to configure the Google Cloud Search Publisher.
For this step, please follow the Configuration Tutorial of the connector of your choice. Please refer to the Connector list.
A transform stage will need to create Aspire objects that have the structure described below. These Aspire objects will be used as input to the GCS publisher. There are two mechanisms for creating a transform stage:
The Google Cloud Search Publisher expects an incoming Aspire Object that has these fields:
Field | Description |
---|---|
doc.qid (required) | The unique id for the document |
doc.isPublic (optional) | If true, the ACLs on the item will be set to viewable by the entire G-Suite domain |
doc.hierarchy (Required if using ACL Inheritance or Document Hierarchy, otherwise optional) | Supplies the following values:
|
doc.content | The content of the document in text format.
|
doc.structuredData (required) | The fields that make up the Google Cloud Search document's structured data as specified in the Google Cloud Search schema. The following property types are supported. The conversion to the appropriate Google Cloud Search property type is automatically performed if the value meets the specified criteria:
Please note that with the exception of Boolean properties, all other properties can be multi-valued. If an Aspire property for the doc.structuredData holds any other type of value than those specified above, the property will not be added with the document in the index. |
doc.enumsData (optional) | Holds properties that will be loaded in the Google Cloud Search structured metadata as enum fields. |
doc.contentFormat (optional) | TEXT or RAW; overrides the config in the publisher. |
doc.content (optional) | Holds the text if the contentFormat set in the aspire object or the publisher config is set to TEXT. |
job.variable['contentStream'] (optional) | Holds the input stream to be used for sending the content in RAW format. |
job.variable['contentBytes'] (optional) | Holds the bytes array to be used for sending the content in RAW format. The contentStream takes precedence over the contentBytes if both are set. |
doc.acls (optional) | Holds ACL information about a document's ACL in the typical format produced by Aspire connectors. Additional notes:
For more info on the structure of an ACL in Google Cloud Search, please visit this link. |
doc.action (required) | Add, update or delete.
|
doc.datasourceId (optional) | Overrides the data source id set in the configuration |
doc.indentitySourceId (optional) | Overrides the identity source id set in the configuration |
doc.aclInheritanceType (optional) | Overrides the ACL inheritance type set in the configuration |
doc.isContainer (optional, defaults to true) | If true, the document is indexed as a CONTAINER_ITEM, if false it will be indexed as a CONTENT_ITEM |
doc.metadata | The document's Google Cloud Search metadata
|
Example of a serialized Aspire Object that would work as an input to the Google Cloud Search publisher:
{ "doc":{ "id":"Chocolat", "content":"Awesome movie", "metadata":{ "sourceRepositoryUrl":"https:\/\/www.imdb.com\/title\/tt0241303\/", "title":"Chocolat", "objectType":"doc", "contentLanguage":"en", "mimeType":"plain\/text", "keywords":[ "keyword1", "keyword2" ], "hash":"asdfsdagwerew", "searchQualityScore":0.1, "createTime":"2018-08-29T05:42:23.226Z", "updateTime":"2018-08-29T05:42:23.226Z" }, "structuredData":{ "actorName":[ "Johnny Depp", "Juliette Binoche" ], "movieTitle":"Chocolat", "mpaaRating":"G", "duration":5, "created":"Wed Aug 29 06:42:23 IST 2018", "releaseDate":"2018-08-29T05:42:23.226Z", "inTheaters":true, "yearsSinceRelease":2.1 }, "enumsData":{ "genre":"Drama" }, "acls":{ "acl":[ { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" }, { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#Actors" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#ralfaro" }, { "@access":"deny", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jroberts" }, { "@access":"deny", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#jdepp", "@identitySourceId":"adde86161b95e302ca1bc4cb18d45d61" }, { "@access":"deny", "@entity":"group", "@principalType":"CloudIdentity", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#[email protected]" } ], "parentAcl":[ { "@access":"allow", "@entity":"group", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#confluence-administrators" }, { "@access":"allow", "@entity":"user", "@name":"421aa90e079fa326b6494f812ad13e79#confluence-prefix#admin" } ] }, "hierarchy":{ "item":{ "@id":"childid2", "@level":"2", "@name":"Aspire", "ancestors":{ "ancestor":{ "@id":"parentid2", "@level":"1", "@name":"Aspire", "@url":"https:\/\/www.imdb.com\/title\/parent\/tt0241303\/" } } } }, "isContainer":false, "action":"add" } }
To add a Publish to Google Cloud Search, drag from the Publish to Google Cloud Search rule from the Workflow Library and drop to the Workflow Tree where you want to add it.
This will automatically open the Publish to Google Cloud Search window for the configuration of the publisher.
Feature | Configuration Information |
---|---|
1. Credentials Key File | Path to the file that contains the credentials needed to connect to Google Cloud Search. Follow Google's instructions to create a credentials file. |
2. GCS Data Source Id | The id for the Google Cloud Search data source that will hold the indexed documents. If the incoming Aspire Object specifies a Data Source Id (datasourceId) this value will override the configuration value. |
3. GCS API Root Url | The root of the API Url - e.g. https://cloudsearch.googleapis.com |
4. GCS Identity Source Id | If there is any content that requires security trimming, this config specifies the id of the Google Cloud Identity Source that stores the users and or groups. If the Aspire Object or the Aspire Object ACL specifies an identity source id (identitySourceId), this id will override the identity source specified in the configuration. |
5. Use Async for Full Crawls | Google Cloud Search allows indexing requests to be sent in either SYNCHRONOUS or ASYNCHRONOUS mode. If this is turned on indexing requests for full crawls will be sent in ASYNCHRONOUS mode. Otherwise all requests will be SYNCHRONOUS. |
6. Populate the GCS Unique Id in Structured Data | If turned on it will populate the item's unique GCS identifier in a gcsUniqueId field in the structured data. The GCS schema needs to include this field as a Text property. |
7. Populate the contentSize in Structured Data | If turned on it will populate the item's content size in a contentSize field in the structured data. The GCS schema needs to include this field as an Integer property. |
8. Populate the internalUrl in Structured Data | If turned on it will populate an internal Url field in the structured data with the item's qid. The GCS schema needs to include this field as a Text property. |
9. Suffix to include in the GCS Unique Id | This suffix will be added to the item's qid before generating the GCS Unique Id hash/value. The suffix name is typically changed only after the contents of data sources have been wiped out. |
10. Document Hierarchy | Google Cloud Search supports containment relationships for documents stored in a data source. If document hierarchy is turned on, items indexed into Google Cloud Search will be linked to their parent containers. |
11. ACL Inheritance Type | Google Cloud Search supports ACL inheritance for documents stored in a data source. If the incoming Aspire Object specifies the ACL inheritance type (aclInheritanceType) this property will override the publisher's configuration. Available options supported by the Google Cloud Search Publisher:
|
12. Indexer Type | The type and version of API or libraries used for index data into Google Cloud Search:
|
13. ACL email entities as G-Suite Principals | If the ACL entity name is an email set the entity as a GSuitePrincipal in the Google Cloud Search ACL. It applies to all docs/acls going through the publisher. This setting can be overridden for a particular document if the principalType attribute is set to CloudIdentity or GSuitePrincipal in the acl object. |
14. Stream vs Batch | Stream - each record will be sent by itself to Google Cloud Search in the publisher's process method. Batch - all the records created in a batch by a connector's scanner are sent together as a batch to Google Cloud Search in the publisher's endBatch method. Google Cloud Search Item Upload and Medial Upload requests can only be sent one by one (there is no batch option) - these requests are created by the publisher when the content for an item is > 100Kb. Indexing requests cannot have the content sent inline when the content is larger than 100Kb. |
15. Content Format | The text must be extracted before it gets to the publisher. Raw - the content come in its raw format (as an input stream or byte array). The Content format can be overridden at the aspire object level by setting the doc.contentFormat field to TEXT or RAW. |
16. Retry Http Error Codes | Comma delimited list of http response status codes for which the requests should be retried. |
17. Ignore Http Error Codes | Comma delimited list of http response status codes which should not be ignored and not reported as errors. |
18. Limit Item Content | Impose a max limit for the size of the raw content that can be sent to GCS or the time it takes to read the content. Items /w content over this size or which take longer then the timeout (whichever is hit first)will not have their content sent to GCS. APPLIES ONLY TO ITEMS SENT AS (i.e. contenFormat) RAW OR HTML TO GCS. |
19. Limit Extracted Text Size | Impose a max limit for the size of the extracted text that can be sent to GCS. Items /w text over this size will the text trimmed. APPLIES ONLY TO ITEMS SENT AS (i.e. contenFormat) TEXT TO GCS. |
20. Log Raw Content Fetch Stats | Log as INFO: the URL of the item for which the content is fetched, the time it took for the fetch and # of transferred bytes. |
21. Connection timeout | The amount of time to wait (in seconds) before timing out the creation of a Google Cloud Search connection. |
22. Read timeout | The amount of time to wait (in seconds) before timing out the creation of a read/write operation. |
23. Retries | The number of times to retry a failed indexing request. Currently only failures due to quota limit errors (http response code 429) are retried. The retries are done in an exponential back off fashion. |
24. Retry delay | Amount of time (in seconds) to wait until the first retry. |
25. Maximum retry delay | The maximum amount of time (in minutes) to wait across all retries. |
26. Retry delay multiplier | The multiplier used to increase the retry delay from the prior retry iteration. |
27. Debug | When turned off, all exception messages/errors will be written to the logs, including info about retry attempts and details about the content of the records that failed. Records that belong to the same Google Cloud Search batch submission will be tagged with a unique identifier, so they can easily be grouped/found together. When the Debug flag is on summary, info about every record that is successfully published will be printed as well in addition to the info and other general debug messages. |