Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Azure Data Lake Source 


Connector will CRAWL Folders and Files (configuration

...

dependent)

...

. Execution will

...

result on the following

...

    • Fullname
    • Length
    • Group
    • User
    • Permission
    • Last Access Time
    • AclBit
    • Block Size
    • Expiry Time
    • ReplicationFactor
    • isContainer
    • Fetch Url
    • Last Modified Date
    • Acls

...

fields to be populated:



PropertyTypeDescription

Fullname

StringFull path of the directory or file (from root path "/")
NameStringFile name (minus the path) of the directory or file
Length

Long

Length of a file (does not apply for directories
GroupStringID of the group that owns this file/directory

User

StringID of the user that owns this file/directory

Permission

StringUnix-style permission string for this file or directory

Last Access Time

DateDate Time of the last time the file was accessed

AclBit

Boolean Flag indicating whether file has ACLs set on it

Block Size

LongBlock size reported by server

Expiry Time

DateDate Time at which the file expires, as UTC time

ReplicationFactor

IntReplication Factor reported by server

isContainer

BooleanIndicates "true" if is a directory, otherwise File

Fetch Url

StringAzure Data Lake full Absolute Path including FQDN. adl://[yourdomain].azuredatalakestore.net/full/path/to.file

Last Modified Date

DateDate Time of the last time the file was modified

Acls

ACL ArrayList of access for file or folder


The following code block will show console output example of crawling of a folder called /test located at root of testing Data Lake Storage adl://dlsjose.azuredatalakestore.net


Code Block
languagebash
2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Received job - action: start
2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing crawl: 15278734929721528133426127
2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing statistics for crawl: 15278734929721528133426127
2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls - please wait...
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Sending start job for crawl: 1527873492972 (status: INI) Clearing queues, snapshot, hierarchy and intersection acls took 200 ms
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ProcessQueueLoaderCrawlController]: QueueLoader (process)Offering crawl status checker thread startedroot
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ProcessQueueLoaderCrawlController]: Sending QueueLoader (process) item claim thread startedstart job for crawl: 1528133426127 (status: I)
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) itemcrawl status claimchecker thread started
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoaderProcessQueueLoader]: QueueLoader (scanprocess) crawlitem statusclaim checker thread started
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerScanQueueLoader]: Clearing queues, snapshot, hierarchy and intersection acls took 200 msQueueLoader (scan) item claim thread started
2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerProcessQueueLoader]: QueueLoader Offering(process) crawl rootstatus checker thread started
2018-06-01T1704T17:1830:14Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl start job
2018-06-01T1704T17:1830:14Z26Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: [/test]
2018-06-01T1704T17:1830:15Z28Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl://dlsjose.azuredatalakestore.net/test
2018-06-01T1704T17:1830:15Z28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager/ProcessCrawlRoot]: Added root item: /test
2018-06-01T1704T17:1830:16Z28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl://dlsjose.azuredatalakestore.net/test
2018-06-01T1704T17:1830:17Z29Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test
2018-06-01T1704T17:1830:17Z30Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test
2018-06-01T1704T17:1830:17Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/NOACCESS
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/dlsjose.azuredatalakestore.net/test/NOACCESS
2018-06-01T1704T17:1830:17Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: dlsjose.azuredatalakestore.net/test/NOACCESSsubtest
2018-06-01T1704T17:1830:18Z31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test5.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test6.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test scanned 5 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/adl://dlsjose.azuredatalakestore.net/test/NOACCESS
2018-06-01T1704T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest
2018-06-01T17:18:19Z30:31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/test5.txt
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test6.txt
2018-06-01T17:18:19Z WARN [/aspire_azuredatalakestore/RAP]: Unable to access path: '/test/NOACCESS'. Missing READ and EXECUTE access. Please check your application created. Skippeddlsjose.azuredatalakestore.net/test/subtest
2018-06-01T1704T17:1830:19Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> ScanProcessing Itemcrawl - Azure DataLake Store: /test/NOACCESS
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/NOACCESS scanned 0 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtestadl://dlsjose.azuredatalakestore.net/test/test4.txt
2018-06-01T1704T17:1830:19Z32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/subtest/sub-sub-test
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/dlsjose.azuredatalakestore.net/test/test5.txt
2018-06-01T1704T17:1830:20Z32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/subtest/test1.txt
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test7dlsjose.azuredatalakestore.net/test/test6.txt
2018-06-01T1704T17:1830:20Z32Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest scanned 35 subitems
2018-06-01T1704T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z30:32Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/subtest/test1.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test7dlsjose.azuredatalakestore.net/test/test4.txt
2018-06-01T1704T17:1830:21Z32Z INFO [/aspire_azuredatalakestore/RAPProcessPipelineManager]: >>> Scan Item - Azure DataLake Store: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test8Processing: adl://dlsjose.azuredatalakestore.net/test/test5.txt
2018-06-01T1704T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest/sub-sub-test scanned 2 subitems
2018-06-01T17:18:22Z30:32Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test8/dlsjose.azuredatalakestore.net/test/test6.txt
2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerScanQueueLoader]: PublishedQueueLoader (scan) crawl endstatus thread jobstopped
2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status thread stopped
2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ScanQueueLoaderCrawlController]: QueueLoader (scan)Published crawl statusend thread stoppedjob
2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Crawl ended with status: S
2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoaderProcessQueueLoader]: QueueLoader (scanprocess) item claim thread stopped
2018-06-01T1704T17:1830:23Z35Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoaderScanQueueLoader]: QueueLoader (processscan) item claim thread stopped

Output for META DATA found for a file

Code Block
languagexml
<job id="192.168.56.1:50505/2018-06-01T15:56:21Z/1/18" time="2018-06-01T17:02:45Z">
<doc>
  <qid>/test/test6.txt</qid>
  <id>/test/test6.txt</id>
  <connectorSpecific>
    <field name="fullname">/test/test6.txt</field>
    <field name="length">0</field>
    <field name="group">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="user">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="permission">770</field>
    <field name="lastAccessTime">Mon May 28 15:31:16 CST 2018</field>
    <field name="aclBit">true</field>
    <field name="blocksize">268435456</field>
    <field name="expiryTime"/>
    <field name="replicationFactor">1</field>
    <field name="isContainer">false</field>
  </connectorSpecific>
  <fetchUrl>/test/test6.txt</fetchUrl>
  <url>/test/test6.txt</url>
  <lastModified>Mon May 28 15:31:16 CST 2018</lastModified>
  <acls>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xxx.azuredatalakestore.net\41599999-13e0-4431-9b35-d2da6e9ccee8" name="user:41599999-13e0-4431-9b35-d2da6e9ccee8:rwx" scope="global"/>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xx.azuredatalakestore.net\" name="group::rwx" scope="global"/>
  </acls>
  <displayUrl>/test/test6.txt</displayUrl>
  <action>add</action>
  <docType>item</docType>
  <sourceName>aspire-azuredatalakestore</sourceName>
  <sourceType>azureDataLakeStore</sourceType>
  <sourceId>aspire_azuredatalakestore</sourceId>
  <repItemType>aspire/file</repItemType>
  <hierarchy>
    <item id="2FED9DB88E9569860C5F71054971EC21" level="3" name="test6.txt" url="/test/test6.txt">
      <ancestors>
        <ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="2" name="test" parent="true" type="aspire/folder" url="/test"/>
        <ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" type="aspire/folder" url="/"/>
      </ancestors>
    </item>
  </hierarchy>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=UTF-8</contentType>
  <extension source="ExtractTextStage">
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="Content-Encoding">UTF-8</field>
    <field name="resourceName">/test/test6.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[
]]></content>
  <contentLength source="ExtractTextStage">1</contentLength>
</doc>
</job>


If any other Component Add after all these sections