The Load HDFS stage gets an AspireObject associated with a key from HDFS and set it into the incoming AspireObject. When stage gets initialized it stores a MapDB cache of the key/values inside HDFS making future retrieves faster.

Communication to HDFS will be through the HDFS API FileSystem methods.

Configuration

ElementTypeDefaultDescription
hdfsLocationStringhdfs://localhost:8020The HDFS Namenode URL.
folderPathString
The path within the HDFS server where the files will be retrieved from.
dbFileStringdata/${app.bundle.name}/jobs.mapdbThe folder where the MapDB files are going to be stored.
warmOnStartupbooleanfalseWhether or not to warm the MapDB cache on initialization
keyFieldStringhdfsKeyThe field name where the key to retrieve comes from the incoming AspireObject
outputFieldStringhdfsValueThe field from the incoming AspireObject where the result of the retrieve is going to store the resulting AspireObject.


Example

This section provides an example of Load HDFS configuration to a local HDFS server.

<component name="PostHDFS" subType="default" factoryName="aspire-hadoop-hdfs">
  <hdfsLocation>hdfs://localhost:8020/</hdfsLocation>
  <folderPath>/user/jsmith/test/</folderPath> 
  <warmOnStartup>true</warmOnStartup>
  <dbFile>jobs.mapdb</dbFile>
  <keyField>hdfsKey</keyField>
  <outputField>hdfsValue</outputField>
</component>


Output

<doc>
  <hdfsKey>Monday</hdfsKey>
  <hdfsValue>
    <doc>
       <weekDay>Monday</weekDay>
       <name>jsmith</name>
       <date>2013/07/16</date>
       <url>http://www.searctechnologies.com/products/we-are-great.html</url>
    </doc>
  </hdfsValue>
</doc>
  • No labels