HDFS Connector Introduction

Created by Andres Aguilar, last modified by user-1b188 on Oct 03, 2018

The Hadoop Distributed File system (HDFS) connector will crawl content from any given HDFS Cluster using the WebHDFS http interface.

Features

Some of the features of the HDFS connector include:

Performs incremental crawling (so that only new/updated documents are indexed)
Metadata extraction
Is search engine independent
Runs from any machine with HTTP access to the given HDFS Namenode
Filters the crawled documents by paths (including file names) using regex patterns
Supports Kerberized Clusters by using a delegation token.
Supports Archive file processing; for more information, visit Archive files processing

WebHDFS Operations

Only two operations are used by this connector:

http://<host>:<port>/webhdfs/v1/<path>?op=OPEN

Used to fetch the file data to be used to extract its content.

http://<host>:<port>/webhdfs/v1/<path>?op=LISTSTATUS

Used to scan a directory and get relevant file information like:
- Last-Modified dates for incremental crawls.
- Group and Owner

No labels