You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Azure Data Lake Source 

Connector will CRAWL Folders and Files (configuration dependant) and will pull the following metadata

    • Fullname
    • Length
    • Group
    • User
    • Permission
    • Last Access Time
    • AclBit
    • Block Size
    • Expiry Time
    • ReplicationFactor
    • isContainer
    • Fetch Url
    • Last Modified Date
    • Acls

Here a RUN example for a Crawlin on a folder "/test"


2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Received job - action: start
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing crawl: 1527873492972
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing statistics for crawl: 1527873492972
2018-06-01T17:18:12Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls - please wait...
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Sending start job for crawl: 1527873492972 (status: INI)
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status checker thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) item claim thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) item claim thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) crawl status checker thread started
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls took 200 ms
2018-06-01T17:18:13Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Offering crawl root
2018-06-01T17:18:14Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl start job
2018-06-01T17:18:14Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: [/test]
2018-06-01T17:18:15Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test
2018-06-01T17:18:15Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager/ProcessCrawlRoot]: Added root item: /test
2018-06-01T17:18:16Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/NOACCESS
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest
2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/NOACCESS
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test4.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test5.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test6.txt
2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test scanned 5 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/NOACCESS
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test5.txt
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test6.txt
2018-06-01T17:18:19Z WARN [/aspire_azuredatalakestore/RAP]: Unable to access path: '/test/NOACCESS'. Missing READ and EXECUTE access. Please check your application created. Skipped
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/NOACCESS
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/NOACCESS scanned 0 subitems
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtest
2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test1.txt
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test7.txt
2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest scanned 3 subitems
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test1.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test7.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtest/sub-sub-test
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test8.txt
2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest/sub-sub-test scanned 2 subitems
2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test2.txt
2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test8.txt
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl end job
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) crawl status thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Crawl ended with status: S
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) item claim thread stopped
2018-06-01T17:18:23Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) item claim thread stopped


Output for META DATA found for a file


<job id="192.168.56.1:50505/2018-06-01T15:56:21Z/1/18" time="2018-06-01T17:02:45Z">
<doc>
  <qid>/test/test6.txt</qid>
  <id>/test/test6.txt</id>
  <connectorSpecific>
    <field name="fullname">/test/test6.txt</field>
    <field name="length">0</field>
    <field name="group">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="user">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
    <field name="permission">770</field>
    <field name="lastAccessTime">Mon May 28 15:31:16 CST 2018</field>
    <field name="aclBit">true</field>
    <field name="blocksize">268435456</field>
    <field name="expiryTime"/>
    <field name="replicationFactor">1</field>
    <field name="isContainer">false</field>
  </connectorSpecific>
  <fetchUrl>/test/test6.txt</fetchUrl>
  <url>/test/test6.txt</url>
  <lastModified>Mon May 28 15:31:16 CST 2018</lastModified>
  <acls>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xxx.azuredatalakestore.net\41599999-13e0-4431-9b35-d2da6e9ccee8" name="user:41599999-13e0-4431-9b35-d2da6e9ccee8:rwx" scope="global"/>
    <acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xx.azuredatalakestore.net\" name="group::rwx" scope="global"/>
  </acls>
  <displayUrl>/test/test6.txt</displayUrl>
  <action>add</action>
  <docType>item</docType>
  <sourceName>aspire-azuredatalakestore</sourceName>
  <sourceType>azureDataLakeStore</sourceType>
  <sourceId>aspire_azuredatalakestore</sourceId>
  <repItemType>aspire/file</repItemType>
  <hierarchy>
    <item id="2FED9DB88E9569860C5F71054971EC21" level="3" name="test6.txt" url="/test/test6.txt">
      <ancestors>
        <ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="2" name="test" parent="true" type="aspire/folder" url="/test"/>
        <ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" type="aspire/folder" url="/"/>
      </ancestors>
    </item>
  </hierarchy>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=UTF-8</contentType>
  <extension source="ExtractTextStage">
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="Content-Encoding">UTF-8</field>
    <field name="resourceName">/test/test6.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[
]]></content>
  <contentLength source="ExtractTextStage">1</contentLength>
</doc>
</job>


If any other Component Add after all these sections

  • No labels