As of Aspire 3.3, HBase is a supported NoSQL database that can be used to maintain the Crawl State.

This is the default preference for Aspire Parcel and Service for Cloudera installations, since most Cloudera Hadoop distributions already provide an HBase service that can be used by Aspire.

The Aspire HBase Provider is the component that is responsible for talking to HBase on behalf of Aspire. All configuration for the HBase Provider in Aspire is done in the settings.xml file.

Basic Configuration Example

In order to connect to an unsecured HBase Database, the Zookeeper quorum is required. This is a list of zookeeper servers that will look for the HBase instances.


  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
    </properties>
  </noSQLConnectionProvider>

Namespace Prefix

Aspire will create one namespace per content source. Under each namespace, all of the necessary tables will be created. Each namespace will match the name of the content source with a default "aspire_" prefix.

For example, if you have one Content Source named "Sharepoint Documents" the namespace will be "aspire_Sharepoint_Documents". This prefix can be changed by adding the "namespacePrefix" field to the configuration.

Retries Settings

The Provider will automatically retry the operations in case they couldn't be completed because of connections errors. The maximum retries to execute is configurable using the "maxRetries" option. By default (if nothing is provided), up to five retries will be executed.

Create Namespaces Option

By default, the Provider will try to create the namespaces if they don't exist on HBase. Sometimes the HBase System is configured so that users cannot create or delete namespaces but they are granted the ability to create tables on particular pre-existing namespaces. To avoid Aspire from trying to create namespaces, the "createNamespaces" option can be used with a value of "false". 

If the option is turned off, make sure the namespaces are created before starting the Aspire nodes.


Everything Example

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <namespacePrefix>aspire_crawl_</namespacePrefix>
    <maxRetries>10</maxRetries>
    <createNamespaces>true</createNamespaces>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
    </properties>
  </noSQLConnectionProvider>

Configuring Non-Default Hadoop Cluster Parameters


In a situation where your Hadoop cluster contains non-default configurations for things like ZooKeeper root path, HDFS root directory, HBase ports, etc., you can configure all of these properties using the "properties" section on the settings file. It mimics the Hadoop configuration file properties, so you can add the same properties here.

Example

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
      <property name="hbase.rootdir">hdfs://example0:9025</property>
    </properties>
  </noSQLConnectionProvider>

Configuring Using Hadoop Config Files


If you are running Aspire inside a Hadoop Cluster, you can use the Hadoop Configuration files in order to connect to HBase with the same configuration properties as the rest of the cluster. For that you need to determine where the following files are located.

  • core-site.xml / hdfs-site.xml

Contains information about where the NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

  • hbase-site.xml

Contains information regarding the zookeeper quorum to be used, the rootDirectory on HDFS, zookeeper root directory.

We need a folder containing the previous files that are readable by the user who is running the Aspire process.

Example: If the files are located under "/etc/hbase/conf.cloudera.hbase/"

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
	<configDir>/etc/hbase/conf.cloudera.hbase</configDir>
  </noSQLConnectionProvider>

Security Settings


The HBase Provider is able to connect to secured HBase databases using Kerberos. It only needs a user principal and a keytab file to authenticate with.


  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
	<configDir>/etc/hbase/conf.cloudera.hbase</configDir>
	<security>
		<kerberos>
			<user>[email protected]</user>
			<path>/path/to/clusteruser.keytab</path>
		</kerberos>
	</security>
  </noSQLConnectionProvider>

Felix Properties Requirements


Before launching Aspire, you need to change the felix.properties file and add these lines if the Kerberos authentication is going to be used.

# To append packages to the default set of exported system packages,
# set this value.
org.osgi.framework.system.packages.extra=\
 ...
 sun.security.krb5, \
 com.sun.security.auth.callback

# The following property makes specified packages from the class path
# available to all bundles. You should avoid using this property.
org.osgi.framework.bootdelegation=\
 ...
 javax.security.sasl, \
 sun.security.krb5


  • No labels