The Publish to CDH HDFS publisher will post documents to the form of Key/AspireObject entries to an specific folder in HDFS using either the HDFS API or the WebHDFS http interface.  

Features

Some of the features of the Publish to CDH HDFS publisher include:

  • Works with CDH 5
  • Runs on any machine with access to the HDFS cluster (Windows and Linux).
  • The output key can be defined from an entry of the AspireObject of each document.
  • AspireObjects are serialized/deserialized as JSON in the HDFS files.
  • Output files size can be customized to take advantage of HDFS block sizes. Also makes it easier to move smaller files of a single collection.
  • No labels