You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 9
Next »
The Azure Data Lake Connector will crawl content from an Azure Data Lake Storage Gen2 at either all file systems or specified file system and paths.
Introduction
Azure Data Lake Storage is a comprehensive, scalable, and efficient data lake solution designed for big data analysis and it provides a hierarchical filesystem. It brings the capabilities of Azure Data Lake Storage Gen1 together with the Azure Blob storage.
For more information about the Azure Data Lake Storage Gen 2, see the official Microsoft Overview of Azure Data Lake Storage Gen2 documentation.
Environment and Access Requirements
Repository Support
The Azure Data Lake Connector supports crawling the
Repository | Version | Connector Version |
---|
Azure Data Lake Storage | Gen 2 | 5.1 |
Environment Requirements
Before installing the Azure Data Lake connector, make sure that:
- You have created the necessary Service-to-Service Application account with pertinent access to your Data Lake.
- The Azure Data Lake is up and running.
- You have Admin rights to allow Read and Execute permissions on folders to Crawl.
User Account Requirements
In order to access the Azure Data Lake, an Application Account with sufficient privileges must be supplied. The following fields must be configured in order to set up a new Data Lake connection:
- Account Name (Storage Account Name)
- Application ID
- Application Secret (Application Key)
- Tenant ID
Get an Application Account
- See Microsoft's Use portal to create an Azure Active Directory application and service principal that can access resources, for the steps on how to properly create an Application ID, its key (Client Secret) and Tenant ID. Make sure to write down your Application Key at the time of creation. It will not be shown again after you exit the portal. Important: Make sure to grant the necessary Reader access to your application. See Microsoft Assign an Azure Role documentation.
- Make sure to grant Read and Execute access (at least) to files and folders to crawl. See Microsoft's Manage Access Control documentation. To recursively apply the same parent folder permissions to sub-folders, please see "Apply an ACL recursively" section.
Framework and Connector Features
Framework Features
Name | Supported |
---|
Content Crawling | Yes |
Identity Crawling | Yes |
Snapshot-based Incrementals | Yes |
Non-snapshot-based Incrementals | No |
Document Hierarchy | Yes |
Connector Features
The Azure Data Lake connector has the following features:
- Performs incremental crawling (so that only new/updated documents are indexed)
- Fetches Object ACLs (Access Control Lists) for Azure document-level security
- Runs from any machine with access to the given Data Lake source
- Service-to-Service Authentication via OAuth 2.0 token
Content Crawled
The Azure Data Lake connector is able to crawl the following objects:
Name | Type | Relevant Metadata | Content Fetch and Extraction | Description |
---|
File System | container |
| N/A | Contains Folders and files |
Folders | container |
| N/A | The directories of the files. Each directory will be scanned to retrieve more subfolders or documents |
Files | document |
| Yes | Files stored in folders/subfolders |