As of Aspire 4.0, Elasticsearch is a supported NoSQL database that can be used to maintain the Crawl State.

The Aspire Elasticsearch Provider is the component that is responsible for talking to Elasticsearch on behalf of Aspire. All configuration for

the Elasticsearch Provider in Aspire is done in the settings.xml file.


The Elasticsearch NoSQL Provider for Aspire requires Elasticsearch 7.x to run. It does not run with previous versions.

Basic Configuration

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
</noSQLConnectionProvider>

Aspire will create one set of Elasticsearch indexes for each content source configured. When the content source is deleted, the indexes will be dropped. The index name has following structure:

  • prefix "aspire-"
  • cluster id defined in settings.xml - e.g. "dev"
  • normalized value of the content source name - e.g. "aspider_web_crawler"
  • provider object name - e.g. "processqueue"

Examples of index names: aspire-dev-aspider_web_crawler-processqueue, aspire-dev-aspider_web_crawler-snapshot, aspire-dev-group_expansion_manager-usersandgroups

Authentication

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <authentication type="basic">
        <username>admin</username>
        <password>encrypted:password</password>
    </authentication>
</noSQLConnectionProvider>

Elasticsearch provider can be configured to use Basic authentication if required by the  Elasticsearch server administrator. Username and password must be provided. Password must be encrypted by standard Aspire encryption utilities.

Note

The Elasticsearch authentication was tested using the File Based authentication provided in the free tier. None of the security features included in the subscription tiers were tested nor they are officially supported. The security features were made available for free since version 6.8.0 and 7.1.0. For more information, please visit Elastic's official notice.


Claim prefetch size

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <claimPrefetch>300</claimPrefetch>
    <claim>100</claim>
</noSQLConnectionProvider>

Elasticsearch provider can claim items from queues in larger units and after changing queue items status from Available to InProgress they are sent back to Elasticsearch as a bulk unit. With this technique we can achieve better performance.  We can configure claim unit size parameters based on the current Aspire installation (e.g. standalone/ distributed mode, etc.). This is how it works:

  • Prefetch "claimPrefetch" number of queue items of status "available". The default value is 10 000.
  • Pick randomly a set of size "claim" from the above prefetched set. The default value is 10 000.
  • Change the status of all "claimed" items from "available" to "inProgress" and sent items back as a bulk update request to Elasticsearch.
  • Examine the result of the bulk request. All successfully updated items with no version conflicts can be sent one by one as a result of bulk claim to the connector framework.

Keep search context alive

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <keepSearchContextAlive>5m</keepSearchContextAlive>
</noSQLConnectionProvider>

Elasticsearch provider iterators use Elasticsearch scroll technique. The scrolls are resources which should be deleted after their use. This is done explicitly whenever possible by calling iterators close method or when the iteration is over. There are cases though when the iteration cannot be completed and in that case unused scrolls persist and might potentially reach the limit of available resources. The parameter "keepSearchContextAlive" controls how long the scrolls should stay before deletion. The default value is "5m".  The format of this parameter and other information is described here.

Retries Settings

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <maxRetries>3</maxRetries>
</noSQLConnectionProvider>

The Provider will automatically retry the operations in case they couldn't be completed because of errors. The maximum retries to execute is configurable using the "maxRetries" option. The default value is "5".

Bulk Settings

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <useBulk>true</useBulk>
    <bulkSize>100</bulkSize>
    <bulkTimeout>30s</bulkTimeout>
    <bulkWaitTimeout>5m</bulkWaitTimeout>
</noSQLConnectionProvider>

The Provider will bulk for better performance chosen operations if "useBulk" is true (default false). The above example values of remaining parameters are also their default values

Profiling Settings

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    <debugOutFile>/tmp/aspire/profile.log</debugOutFile>
</noSQLConnectionProvider>

The Provider will log profiling messages to the specified file.

Full configuration

<!-- noSql database provider for the 4.0 connector framework -->
<noSQLConnectionProvider>
    <implementation>com.accenture.aspire:aspire-elasticsearch-provider</implementation>
    <url>http://localhost:9200</url>
    
	<claimPrefetch>300</claimPrefetch>
    <claim>100</claim>
    
	<keepSearchContextAlive>5m</keepSearchContextAlive>
    
	<authentication type="basic">
        <username>admin</username>
        <password>encrypted:password</password>
    </authentication>
    
	<maxRetries>3</maxRetries>
    
	<useBulk>true</useBulk>
    <bulkSize>100</bulkSize>
    <bulkTimeout>30s</bulkTimeout>
    <bulkWaitTimeout>5m</bulkWaitTimeout>
    
	<debugOutFile>/tmp/aspire/profile.log</debugOutFile>
</noSQLConnectionProvider>
  • No labels