Group Expansion Introduction

Group expansion is most often considered at the time a search query is being executed – a user has logged in to a search application with their user name and wants to retrieve data. The data in the search engine has been secured using access control lists (ACLs) which contain the names of users and groups that are permitted or denied access to each item. To provide the users a comprehensive result set, covering all the documents a user has access to, the search engine must know the groups that the user belongs to. However, at login time, the user only provides their username and hence, group expansion is required.

Different search applications can request group expansion with varying frequencies. Some applications may request expansion when the user logs in whereas some may request it for every search. The frequency with which requests are made is not specifically important, as long as the servers are not overloaded. What is important is the speed with which the information is returned. Aspire should be able to respond to a request in a timely manner. Whilst a request made during login may be able to wait a few seconds, a request made during a search must be processed very quickly.

The format of the incoming request is also important. Search applications may expect to submit requests to an ldap server (the GSA for example) or via HTTP. Group expansion must support both of these request types. There are other things to consider:

An Aspire installation will probably have multiple content repositories (Documentum, SharePoint etc). Whilst these will quite often connect to LDAP or Active Directory and use groups from there, it is quite possible they use groups that are unique to the content repository and so it is important that these “local” groups are correctly handled.

If we do have multiple repositories, it’s possible that the username used in one content source is not the same as the username used in another, so this situation must be accounted for.

It is also possible that user and group information from differing repositories comes in different forms (with or without a domain for example).

Finally, the username passed to group expansion could come with or without a domain, or may not even be a “standard” username (it could be a SID/GUID for example).

The use of caching in group expansion

As mentioned above, it’s important that the requests for group expansion are processed as quickly as possible. This would suggest that some sort of caching is required. There is a potential issue here. What happens if you have an entry in the cache that gives the user access to a group from which he is later removed? It is theoretically possible that he will return a search result that he can no longer access. However, most system implementations will then direct the user to the original repository to retrieve the document at which point the request will be rejected, so any “risk” of the user seeing prohibited content is mitigated.

Group Expansion Process in Aspire

At the very high level, group expansion inside Aspire can be split in to two processes:

Group collection and caching
Processing expansion requests

The first process is responsibility of each content source or the LDAP Cache service, each connector knows how to retrieve the users and their groups from the repository, and the connector framework stores them into the Group Expansion Manager database cache.

The second process is executed by the Group Expansion Manager service, which receives the HTTP request and queries its database for the username and groups cached in its database.

The high level architecture for group expansion becomes:

Content Source
- Group collection and caching
Group Expansion Manager
- Processing expansion requests

The group expansion manager will also handle conversion of requests from external sources via HTTP or LDAP to a format Aspire can use and then to convert the responses back to the appropriate form.

The high level architecture can be seen in the diagram below:

Group collection and caching

Whatever the content source, request format or frequency, the key to the group expansion process is the ability to get the groups for a user. Since, for content source data extraction purposes, we have already built a connector that understands how to connect to the content source and has all the appropriate jar files for the content source API built in, we use this to collect the groups and insert in them in to a cache.

A scheduler component will periodically send the content source connector component a job to tell it to reload the cache. When the connector component receives this job, it will update the group expansion manager database cache with the newly downloaded groups. These groups are downloaded in a separate thread (so as not to block any repository scanning). The connector requests:

a list of all users and the groups to which they belong, and
a list of groups and the groups to which they belong (for nested group expansion)

Once the connector has this information, it will:

Consider each user in turn and perform un-nesting of the groups, to produce a cache of users against the (full set of) groups to which they belong.
The cache will always contain updated users/groups data, since the connector will remove any deleted user, update the groups for any given one or add any new user.

Un-nesting the groups involves talking the user and looking up the groups to which he belongs and then looking up each of the groups found and now looking to see if those are members of other groups. This process is repeated until no new groups are found. Thus, if a user is a member of group “one” and group “two” and group “two” is a member of group “three” and group “four”, the entry in the cache will record the user as a member of groups “one”, “two”, “three” and “four”.

An added complication when calculating group membership is that some content source repositories can use external groups (typically from and ldap or active directory server). These groups typically can be members of “local” groups to the repository but are not always reported back in the list of groups for the repository and so it can be difficult to work out what “local” groups they belong to. Thus the connector can be optionally supplied with a list of groups from an external source. These groups are then looked up by the connector to ensure they are handled correctly.

At the end of the group collection process, the cache will contain a consistent set of users against the groups to which they belong. This cache will then be used to serve group expansion requests until the next time the scheduler determines the cache should be reloaded. At this point the process begins again.

Processing expansion requests

At the point at which group expansion needs to be performed for a user, the user will be looked up in the group expansion manager database cache. The input to this process (known as a Group Expansion Request) will be an Aspire job with particular attributes, including the username of the user to be looked up. The response (a Group Expansion Response) will include a set of groups to be returned to the requester.

The cache is assumed to have already been populated and the group expansion manager executes a single query against the database to obtain the results which will be merged before returned (there can be multiple entries for a single user coming from different repositories with different groups).

Group Expansion Manager

The group expansion manager is responsible for four main areas of processing:

Receive external requests
- And return a response when all processing has been done.
Provide workflow components that allow the operator to make changes to the request and response.
Add external group data in to the responses
Query the username in its database cache (populated by the different connectors)

The architecture can be seen below:

Receiving requests

The group expansion manager includes the components for receiving group expansion requests. It publishes a servlet in Aspire with the (default) path /groupExpansion (ie http://localhost:50505/groupExpansion). This servlet expects a single parameter (username) sends a group expansion job in to Aspire. Once processing has been performed, the servlet will return a list of groups to which the user belongs. Thus a call to

http://localhost:50505/groupExpansion?username=tesla

provides a response in the form:

  <groups>
    <group>tesla</group>
    <group>scientists</group>
    <group>italians</group>
    <group>group1</group>
    <group>group2</group>
    <group>group3</group>
    <group>group4</group>
    <group>PUBLIC:ALL</group>
    <group>xxxxxx</group>
  </groups>

Note that the user itself (tesla in the above example) is returned as a pseudo group.

The group expansion manager also includes a “proxy ldap server”. This is disabled by default and requires pre-installation of another service (Ldap group cache). When enabled, this proxy allows search engines such as the GSA to use Aspire for group expansion. The proxy expects to receive all requests from the engine. Requests for groups for user are intercepted. The username is extracted from the ldap request and sent as a “group expansion request” job to the same pipeline as the http requests. Once the expansion has been performed, the returned groups are gathered and formatted as an ldap server response and sent back to the engine.

Requests which are not requests for groups (such as general ldap searches or login requests) are forwarded to a “real” ldap server via an ldap connection component. This ldap connection component is not installed as part of the group expansion manager and must be configured externally. The ldap connection component is included in the Ldap Group Cache service (see later).

Workflow

The group expansion manager includes a number of workflow processors to allow the system administrator the chance to manipulate the group expansion requests or responses and change either the user to be looked up, or the groups to be returned.

Workflow rules can be added via the UI and are executed in the following points in the process:

Workflow name	Position in process	Usage
onRequest	Immediately after the request is received	Change or modify the user name to be looked up
onResponse	After all expansion has been performed, before the groups are returned back to the requester	Modify the groups returned to the requester

Domain Handling in the Group Expansion Manager

By default, the group expansion manager does not alter the domains of incoming requests or outgoing responses. However, the manager allows the following options for both incoming and outgoing domains:

Option	Description
Leave alone	The username is untouched. If the username has a domain it will be left alone. If it doesn’t, none will be added
Strip	Any domain will be removed from the user or group name
Add	The specified domain name will be added to the user or group name (replacing of any existing domain name)

Supplementary groups

The group expansion manager will allow you to add supplementary groups should you need to. These groups are added after all other expansion has been performed.

PUBLIC:ALL

The group expansion manager by default adds the Aspire “PUBLIC:ALL” group. This group is used by connectors to indicate content that is identified as public. The addition of this can be turned off if required.

Additional Static Groups

The group expansion manager will optionally add static groups. These are any additionally groups that you wish to be added to expansion responses. You may configure as many as you wish by specifying the name of the group (including domain if required) in the UI. Note that the groups are added exactly as entered and any domain remains unaltered, even if you have configured domain handling as described above.

Page tree

Group Expansion Service Introduction