The Hadoop Distributed File system (HDFS) connector will crawl content from any given HDFS Cluster using the WebHDFS http interface.

Features


Some of the features of the HDFS connector include:

  • Performs incremental crawling (so that only new/updated documents are indexed)
  • Metadata extraction
  • Is search engine independent
  • Runs from any machine with HTTP access to the given HDFS Namenode
  • Filters the crawled documents by paths (including file names) using regex patterns
  • Supports Kerberized Clusters by using a delegation token.
  • Supports Archive file processing; for more information, visit Archive files processing

WebHDFS Operations


Only two operations are used by this connector:

http://<host>:<port>/webhdfs/v1/<path>?op=OPEN

  • Used to fetch the file data to be used to extract its content.

http://<host>:<port>/webhdfs/v1/<path>?op=LISTSTATUS

  • Used to scan a directory and get relevant file information like:
    • Last-Modified dates for incremental crawls.
    • Group and Owner
  • No labels