Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Azure Data Lake Source 

Detailed information on configuring the Azure Data Lake Source. Also includes a discussion of all of the metadata produced by the Source, with an example.

Azure Data Lake Fetch URL 

Detailed information on the Azure Data Lake FetchUrl java component which opens an InputStream to the given URL which can be read by down-stream pipeline stages. Only needed by programmers who might need to integrate the Azure Data Lakescanner in novel ways not supported by the standard framework and routing table.

Azure Data Lake Group Expansion 

...

Connector will CRAWL Folders and Files (configuration dependant) and will pull the following metadata

    • Fullname
    • Length
    • Group
    • User
    • Permission
    • Last Access Time
    • AclBit
    • Block Size
    • Expiry Time
    • ReplicationFactor
    • isContainer
    • Fetch Url
    • Last Modified Date
    • Acls

Here a RUN example for a Crawlin on a folder "/test"


Code Block
languagebash
Felix> 2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Received job - action: start
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing crawl: 1527873492972
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing statistics for crawl: 1527873492972
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls - please wait...
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Sending start job for crawl: 1527873492972 (status: INI)
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status checker thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) item claim thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) item claim thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) crawl status checker thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls took 200 ms
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Offering crawl root
2018-06-01T17:18:14Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl start job
2018-06-01T17:18:14Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: [/test]
2018-06-01T17:18:15Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test
2018-06-01T17:18:15Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager/ProcessCrawlRoot]: Added root item: /test
2018-06-01T17:18:16Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/NOACCESS
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/NOACCESS
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test5.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test6.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test scanned 5 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/NOACCESS
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test5.txt
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test6.txt
2018-06-01T17:18:19Z WARN [/aspire_azuredatalakestore/RAP]: Unable to access path: '/test/NOACCESS'. Missing READ and EXECUTE access. Please check your application created. Skipped
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/NOACCESS
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/NOACCESS scanned 0 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtest
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test1.txt
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test7.txt
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest scanned 3 subitems
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test1.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test7.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test8.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest/sub-sub-test scanned 2 subitems
2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test8.txt
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl end job
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) crawl status thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Crawl ended with status: S
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) item claim thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) item claim thread stopped


Output for META DATA found for a file


Code Block
languagexml
<job id="192.168.56.1:50505/2018-06-01T15:56:21Z/1/18" time="2018-06-01T17:02:45Z">
<doc>
  <qid>/test/test6.txt</qid>
  <id>/test/test6.txt</id>
  <connectorSpecific>
    <field name="fullname">/test/test6.txt</field>
    <field name="length">0</field>
    <field name="group">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="user">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="permission">770</field>
    <field name="lastAccessTime">Mon May 28 15:31:16 CST 2018</field>
    <field name="aclBit">true</field>
    <field name="blocksize">268435456</field>
    <field name="expiryTime"/>
    <field name="replicationFactor">1</field>
    <field name="isContainer">false</field>
  </connectorSpecific>
  <fetchUrl>/test/test6.txt</fetchUrl>
  <url>/test/test6.txt</url>
  <lastModified>Mon May 28 15:31:16 CST 2018</lastModified>
  <acls>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xxx.azuredatalakestore.net\41599999-13e0-4431-9b35-d2da6e9ccee8" name="user:41599999-13e0-4431-9b35-d2da6e9ccee8:rwx" scope="global"/>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xx.azuredatalakestore.net\" name="group::rwx" scope="global"/>
  </acls>
  <displayUrl>/test/test6.txt</displayUrl>
  <action>add</action>
  <docType>item</docType>
  <sourceName>aspire-azuredatalakestore</sourceName>
  <sourceType>azureDataLakeStore</sourceType>
  <sourceId>aspire_azuredatalakestore</sourceId>
  <repItemType>aspire/file</repItemType>
  <hierarchy>
    <item id="2FED9DB88E9569860C5F71054971EC21" level="3" name="test6.txt" url="/test/test6.txt">
      <ancestors>
        <ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="2" name="test" parent="true" type="aspire/folder" url="/test"/>
        <ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" type="aspire/folder" url="/"/>
      </ancestors>
    </item>
  </hierarchy>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=UTF-8</contentType>
  <extension source="ExtractTextStage">
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="Content-Encoding">UTF-8</field>
    <field name="resourceName">/test/test6.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[
]]></content>
  <contentLength source="ExtractTextStage">1</contentLength>
</doc>
</job>


If any other Component Add after all these sections