You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

As of Aspire 3.2 HBase is now part of the supported NoSQL Databases that can be used to maintain the Crawl State.

This is the default preference for Aspire Parcel and Service for Cloudera installations, since most Cloudera Hadoop Distributions already provide an HBase service that could be used by Aspire.

The Aspire HBase Provider is the component that is responsible for talking to HBase on behalf of Aspire. All configuration for the HBase Provider in Aspire is done in the settings.xml file.

Basic Configuration Example

In order to connect to an unsecured HBase Database, the Zookeeper quorum is required. This is a list of zookeeper servers that will look for the HBase instances:


  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
    </properties>
  </noSQLConnectionProvider>

Namespace prefix

Aspire will create one namespace per content source. Under each namespace, all of the necessary tables are going to be created. Each namespace created will match the name of the content source with a default of "aspire_" as a prefix.

For example, if you have one Content Source named "Sharepoint Documents" the namespace will be "aspire_Sharepoint_Documents". This prefix can be changed by adding the "namespacePrefix" field to the configuration:

Retries settings

The Provider will automatically retry the operations in case they couldn't be completed because of connections errors, the maximum retries to execute is configurable by using the "maxRetries" option, by default if nothing is provided, up to 5 retries will be executed.

Create Namespaces option

By default the Provider will try to create the namespaces if they don't exist on HBase, but sometimes the HBase System is configured so that users cannot create or delete namespaces, but they are granted to create tables on particular pre-existent namespaces. To avoid Aspire from trying to create namespaces the "createNamespaces" option can be used with a value of "false". If is option is turned off, you have to make sure the namespaces are pre-created before starting the Aspire nodes.

Everything example:

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <namespacePrefix>aspire_crawl_</namespacePrefix>
    <maxRetries>10</maxRetries>
    <createNamespaces>false</createNamespaces>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
    </properties>
  </noSQLConnectionProvider>

Configuring non-default Hadoop cluster parameters


If your Hadoop cluster contains non default configurations for things like: ZooKeeper root path, HDFS root directory, HBase ports, etc. You can configure all of those properties using the "properties" section on the settings file. It mimics the Hadoop configuration file properties, so you can add the same properties here.

Example:

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
    <properties>
      <property name="hbase.zookeeper.quorum">zookeeper-server</property>
      <property name="hbase.rootdir">hdfs://example0:9025</property>
    </properties>
  </noSQLConnectionProvider>

Configuring using Hadoop config files


If you are running Aspire inside a Hadoop Cluster, you can use the Hadoop Configuration files in order to connect to HBase with the same configuration properties as the rest of the cluster. For that you need to find where the following files are located:

  • core-site.xml / hdfs-site.xml

Contains information about where the NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

  • hbase-site.xml

Contains information regarding the zookeeper quorum to be used, the rootDirectory on HDFS, zookeeper root directory.

We need a folder containing the previous files readable by the user that is running the Aspire process.

Example: If the files are located under "/etc/hbase/conf.cloudera.hbase/"

  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
	<configDir>/etc/hbase/conf.cloudera.hbase</configDir>
  </noSQLConnectionProvider>

Security Settings


The HBase Provider is able to connect to secured HBase databases using Kerberos. It only needs a user principal and a keytab file to authenticate with.


  <!-- noSql database provider for the 3.X connector framework -->
  <noSQLConnectionProvider>
    <implementation>com.searchtechnologies.aspire:aspire-hbase-provider</implementation>
	<configDir>/etc/hbase/conf.cloudera.hbase</configDir>
	<security>
		<kerberos>
			<user>[email protected]</user>
			<path>/path/to/clusteruser.keytab</path>
		</kerberos>
	</security>
  </noSQLConnectionProvider>


  • No labels