Step 2. Add a new Content Source

For this step please follow the step from the Configuration Tutorial of the connector of you choice, please refer to Connector list

Step 3a. Specify Publisher Information

In the publisher window, specify the connection information to publish to the Microsoft Search.

Name: Unique name for the publisher
Tenant Id: The tenant Id provided by Microsoft
Client Id: The client Id generated when the Application was registered in Prerequisites
Client Secret: The client secret that was generated in Prerequisites
Index Name: The name of the connection/index that will be created in MS Search
Groovy Transform: The path to a Groovy transformation file that will process the document to make it match the expected structure from either the fixed ExternalFile or the custom ExternalItem data types in MS Search connection schemas
Use custom schema: Enables the usage of the limited custom schema in MS Search. If disabled it will assume ExternalFile schema, otherwise it will be ExternalItem and will require the path to a JSON schema file
Start/end actions Clear: In order to create the connection automatically on full crawls this must be enabled, otherwise it will be assumed the connection exists

Other configuration items are common to every publisher component.

ExternalFile vs ExternalItem

Microsoft Search allows the usage of one of two schemas: fixed ExternalFile or custom ExternalItem. ExternalItem allows limited freedom to define properties to be expected from crawled items. Needless to say, the Groovy transformation file must yield an output that matches the expected schema.

ExternalFile schema

The fixed external file schema expects the following information:

acl: The list of ACLs for the document
createdDateTime: A standard UTC string that represents the creation date and time
modifiedDateTime: A standard UTC string that represents the last modification date and time
createdBy: The author's name
lastModifiedBy: The name of the last person that modified the document
title: The document's title
url: The document's url
name: The document's name
extension: The document's file name extension
size: The document size
content: The document's content

This is a sample document in JSON format as expected by the REST API:

Code Block

language	js
theme	FadeToGrey

{
	"@odata.type": "microsoft.graph.externalFile",
	"acl": [
		{
			"type": "user",
			"value": "d411eb08-42e2-4316-aab5-2df8e9d9c21b",
			"accessType": "grant",
			"identitySource": "Azure Active Directory"
		}
	],
	"createdDateTime": "2017-11-08T19:06:17Z",
	"modifiedDateTime": "2017-11-08T19:06:17Z",
	"createdBy": "empty",
	"lastModifiedBy": "empty",
	"title": "sample document",
	"url": "http://the.url.com",
	"name": "name.txt",
	"extension": "txt",
	"size": 10,
	"content": "the content/n"
}

ExternalItem schema

As mentioned before, the ExternalItem schema has limited customization capabilities. It expects the following information:

acl: The list of ACLs for the document
properties: A name/value list of custom properties of predefined types
content: The document's content

When configured to use custom schema, the publisher component expects a txt file with the following structure:

Code Block

language	js
theme	FadeToGrey

{
	"properties": [
		{
			"name": "propertyName",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		}
	]
}

Where:

name: The property name in JSON friendly characters
type: The data type of the property. Can be one of the following:
- String
- Int64
- Double
- DateTime (as UTC offset: YYYY-MM-DDTHH:mm:ss.nnZ)
- Boolean
- Collection(String)
- Collection(Int64)
- Collection(Double)
- Collection(DateTime)
isSearchable: (optional) If true, the property will be indexed as containing searchable content
isRetrievable: (optional) If true, the property can be retrieved as part of search results
isQueryable: (optional) If true, the property can be queried for specific values

This is a the default schema file provided with the component (schemaProperties.json):

Code Block

language	js
theme	FadeToGrey

{
	"properties": [
		{
			"name": "id",
			"type": "String"
		},
		{
			"name": "name",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "extension",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "size",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "createdBy",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "lastModifiedBy",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "createdDateTime",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "modifiedDateTime",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		},
		{
			"name": "title",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true"
		},
		{
			"name": "url",
			"type": "String",
			"isSearchable": "true",
			"isRetrievable": "true",
			"isQueryable": "true"
		}
	]
}

Groovy Transformation file

The Groovy transformation file makes it easy to output data that is customized to the client's needs and also that can be safely conveyed to Microsoft Search through the REST API. The output of the transformation must match the expected schema structure.

This is the default Groovy transformation file that is provided with the component (aspireToMicrosoftSearchBulk.groovy):

Code Block

language	java
theme	FadeToGrey

import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.nio.charset.Charset;

def connectorSpecificMap = [
	'isContainer':'is_container'
]

def getContent(String content) {
	try {
		if(content.getBytes().length > 16777216L) {
			return content.substring(0,10485760) + "...";
		} else {
			return content
		}
	} catch(Throwable t) {
		return "";
	}
}

def getMD5(String id) {
	MessageDigest digest = MessageDigest.getInstance("MD5")
	String md5name = new BigInteger(1, digest.digest(id.getBytes())).toString(16)
	return md5name;
}

// Function that process the children of a connector specific field
def getChildren(name, parent) {

	builder."$name"() {
		parent.getChildren().each() { val ->

			def attr = val.getName();

			//if it has other children
			if(val.getChildren().size() > 0) {
				getChildren(attr, val);
			} else {
				builder."$attr"() {
		
					//All the attributes
					val.getAttributeNames().each() { attrName ->
						"@$attrName" val.getAttribute(attrName);
					}

					//Main content
					if (val?.getText() != null) {
						'_$' val?.getText();
					}
				}
			}
		}
	}
}

//***************************************************
//
// Main routine
//

// Action of the job
String action = doc.action.getText();

if ((action == "add") || (action == "update")) {
	/*****************
	* Add or Update *
	*****************/

	builder.$object() {
		'@search.action' "upload"

		// Get ID
		String newId = "";
		if (doc.id != null) {
			newId = getMD5(doc.id.getText())
		} else if (doc.fetchUrl != null) {
			newId = getMD5(doc.fetchUrl.getText())
		} else if (doc.url != null) {
			newId = getMD5(doc.url.getText())
		} else if (doc.displayUrl != null) {
			newId = getMD5(doc.displayUrl.getText())
		} else {
			newId = "ID-NOT-PROVIDED"
		}

		'id' newId

		// name
		String nameOfTheFile = doc.url.getText()
		if(nameOfTheFile != null) {
			String[] urlItems = nameOfTheFile.split('/')
			nameOfTheFile = urlItems[urlItems.length - 1]

			name nameOfTheFile

			String[] fileNameItems = nameOfTheFile.split(/\./)
			if(fileNameItems.length > 1) {
				extension fileNameItems[fileNameItems.length - 1]
			} else {
				extension '[empty]'
			}
		}

		// Size
		if(doc.size != null) {
			size doc.size
		}
	
		// createdBy
		if(doc.author != null) {
			createdBy doc.author
		} else {
			createdBy "empty"
		}

		// lastModifiedBy
		if(doc.lastModifiedBy != null) {
			lastModifiedBy doc.author
		} else {
			lastModifiedBy "empty"
		}

		//createdDateTime
		if(doc.createDate != null) {
			createdDateTime doc.createdDateTime
		} else if(doc.lastModified != null) {
			createdDateTime doc.lastModified
		} else {
			createdDateTime (new Date())
		}

		//modifiedDateTime
		if(doc.lastModified != null) {
			modifiedDateTime doc.lastModified
		} else {
			modifiedDateTime (new Date())
		}

		// title
		if(doc.title != null) {
			title doc.title
		} else if(nameOfTheFile != null) {
			title nameOfTheFile
		} else {
			title '[empty title]'
		}

		if (doc.displayUrl != null) {
			url doc.displayUrl
		} else if (doc.url != null) {
			url doc.url
		} else if (doc.fetchUrl != null){
			url doc.fetchUrl
		} else {
			url "URL-NOT-PROVIDED"
		}

		// content
		if(doc.content != null) {
			content doc.content?.getText()
		}

		// ACLs 
		if (doc.acls != null) {
			builder.acls() {
				$list { 
					doc.acls.getChildren().each() { val ->
						$object() {
							name val.getAttribute("name")
							access val.getAttribute("access")
							entity val.getAttribute("entity")
						}
					}
				}
			}
		//END
		}

	}
} else {
	/**********
	* Delete *
	**********/

	builder.$object() {
		'@search.action' "delete"

		String delId = "";
		// Get ID
		if (doc.id != null) {
			delId = getMD5(doc.id.getText())
		} else if (doc.fetchUrl != null) {
			delId = getMD5(doc.fetchUrl.getText())
		} else if (doc.url != null) {
			delId = getMD5(doc.url.getText())
		} else if (doc.displayUrl != null) {
			delId = getMD5(doc.displayUrl.getText())
		} else {
			delId = "ID-NOT-PROVIDED"
		}

		'id' delId
	}
}

Once you've clicked on the Add button, it will take a moment for Aspire to download all of the necessary components (the Jar files) from the Maven repository and load them into Aspire. Once that's done, the publisher will appear in the Workflow Tree.

Info
For details on using the Workflow section, please refer to Workflow introduction.

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Step 1. Launch Aspire and open the Content Source Management Page

Step 2. Add a new Content Source

Step 3. Add a new Microsoft Search publisher to the Workflow

Step 3a. Specify Publisher Information

ExternalFile vs ExternalItem

ExternalFile schema

ExternalItem schema

Groovy Transformation file

Page tree

Page History

Versions Compared

Old Version 12

New Version 13

Key

Step 1. Launch Aspire and open the Content Source Management Page

Step 2. Add a new Content Source

Step 3. Add a new Microsoft Search publisher to the Workflow

Step 3a. Specify Publisher Information

ExternalFile vs ExternalItem

ExternalFile schema

ExternalItem schema

Groovy Transformation file