Connector will CRAWL Folders and Files (configuration
...
dependent)
...
. Execution will
...
result on the following
...
...
fields to be populated:
Property | Type | Description |
---|---|---|
Fullname | String | Full path of the directory or file (from root path "/") |
Name | String | File name (minus the path) of the directory or file |
Length | Long | Length of a file (does not apply for directories |
Group | String | ID of the group that owns this file/directory |
User | String | ID of the user that owns this file/directory |
Permission | String | Unix-style permission string for this file or directory |
Last Access Time | Date | Date Time of the last time the file was accessed |
AclBit | Boolean | Flag indicating whether file has ACLs set on it |
Block Size | Long | Block size reported by server |
Expiry Time | Date | Date Time at which the file expires, as UTC time |
ReplicationFactor | Int | Replication Factor reported by server |
isContainer | Boolean | Indicates "true" if is a directory, otherwise File |
Fetch Url | String | Azure Data Lake full Absolute Path including FQDN. adl://[yourdomain].azuredatalakestore.net/full/path/to.file |
Last Modified Date | Date | Date Time of the last time the file was modified |
Acls | ACL Array | List of access for file or folder |
The following code block will show console output example of crawling of a folder called /test located at root of testing Data Lake Storage adl://dlsjose.azuredatalakestore.net
Code Block | ||
---|---|---|
| ||
2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Received job - action: start 2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing crawl: 15278734929721528133426127 2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing statistics for crawl: 15278734929721528133426127 2018-06-01T1704T17:1830:12Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls - please wait... 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Sending start job for crawl: 1527873492972 (status: INI) Clearing queues, snapshot, hierarchy and intersection acls took 200 ms 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ProcessQueueLoaderCrawlController]: QueueLoader (process)Offering crawl status checker thread startedroot 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ProcessQueueLoaderCrawlController]: Sending QueueLoader (process) item claim thread startedstart job for crawl: 1528133426127 (status: I) 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) itemcrawl status claimchecker thread started 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoaderProcessQueueLoader]: QueueLoader (scanprocess) crawlitem statusclaim checker thread started 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerScanQueueLoader]: Clearing queues, snapshot, hierarchy and intersection acls took 200 msQueueLoader (scan) item claim thread started 2018-06-01T1704T17:1830:13Z26Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerProcessQueueLoader]: QueueLoader Offering(process) crawl rootstatus checker thread started 2018-06-01T1704T17:1830:14Z26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl start job 2018-06-01T1704T17:1830:14Z26Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: [/test] 2018-06-01T1704T17:1830:15Z28Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl://dlsjose.azuredatalakestore.net/test 2018-06-01T1704T17:1830:15Z28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager/ProcessCrawlRoot]: Added root item: /test 2018-06-01T1704T17:1830:16Z28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl://dlsjose.azuredatalakestore.net/test 2018-06-01T1704T17:1830:17Z29Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test 2018-06-01T1704T17:1830:17Z30Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test 2018-06-01T1704T17:1830:17Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/NOACCESS 2018-06-01T17:18:17Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/dlsjose.azuredatalakestore.net/test/NOACCESS 2018-06-01T1704T17:1830:17Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/test4.txt 2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: dlsjose.azuredatalakestore.net/test/NOACCESSsubtest 2018-06-01T1704T17:1830:18Z31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test4.txt 2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest 2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test5.txt 2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/test6.txt 2018-06-01T17:18:18Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test scanned 5 subitems 2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/adl://dlsjose.azuredatalakestore.net/test/NOACCESS 2018-06-01T1704T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest 2018-06-01T17:18:19Z30:31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/test5.txt 2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/test6.txt 2018-06-01T17:18:19Z WARN [/aspire_azuredatalakestore/RAP]: Unable to access path: '/test/NOACCESS'. Missing READ and EXECUTE access. Please check your application created. Skippeddlsjose.azuredatalakestore.net/test/subtest 2018-06-01T1704T17:1830:19Z31Z INFO [/aspire_azuredatalakestore/RAP]: >>> ScanProcessing Itemcrawl - Azure DataLake Store: /test/NOACCESS 2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/NOACCESS scanned 0 subitems 2018-06-01T17:18:19Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test/subtestadl://dlsjose.azuredatalakestore.net/test/test4.txt 2018-06-01T1704T17:1830:19Z32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/subtest/sub-sub-test 2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/dlsjose.azuredatalakestore.net/test/test5.txt 2018-06-01T1704T17:1830:20Z32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/test/subtest/test1.txt 2018-06-01T17:18:20Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/test7dlsjose.azuredatalakestore.net/test/test6.txt 2018-06-01T1704T17:1830:20Z32Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest scanned 35 subitems 2018-06-01T1704T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test/subtest/sub-sub-test 2018-06-01T17:18:21Z30:32Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/subtest/test1.txt 2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/test7dlsjose.azuredatalakestore.net/test/test4.txt 2018-06-01T1704T17:1830:21Z32Z INFO [/aspire_azuredatalakestore/RAPProcessPipelineManager]: >>> Scan Item - Azure DataLake Store: /test/subtest/sub-sub-test 2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test2.txt 2018-06-01T17:18:21Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: /test/subtest/sub-sub-test/test8Processing: adl://dlsjose.azuredatalakestore.net/test/test5.txt 2018-06-01T1704T17:18:21Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test/subtest/sub-sub-test scanned 2 subitems 2018-06-01T17:18:22Z30:32Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/test/subtest/sub-sub-test/test2.txt 2018-06-01T17:18:22Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: /test/subtest/sub-sub-test/test8/dlsjose.azuredatalakestore.net/test/test6.txt 2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/MainQueuePipelineManager/CrawlControllerScanQueueLoader]: PublishedQueueLoader (scan) crawl endstatus thread jobstopped 2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status thread stopped 2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManagerMain/ScanQueueLoaderCrawlController]: QueueLoader (scan)Published crawl statusend thread stoppedjob 2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Crawl ended with status: S 2018-06-01T1704T17:1830:23Z34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoaderProcessQueueLoader]: QueueLoader (scanprocess) item claim thread stopped 2018-06-01T1704T17:1830:23Z35Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoaderScanQueueLoader]: QueueLoader (processscan) item claim thread stopped |
Output for META DATA found for a file
Code Block | ||
---|---|---|
| ||
<job id="192.168.56.1:50505/2018-06-01T15:56:21Z/1/18" time="2018-06-01T17:02:45Z">
<doc>
<qid>/test/test6.txt</qid>
<id>/test/test6.txt</id>
<connectorSpecific>
<field name="fullname">/test/test6.txt</field>
<field name="length">0</field>
<field name="group">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
<field name="user">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
<field name="permission">770</field>
<field name="lastAccessTime">Mon May 28 15:31:16 CST 2018</field>
<field name="aclBit">true</field>
<field name="blocksize">268435456</field>
<field name="expiryTime"/>
<field name="replicationFactor">1</field>
<field name="isContainer">false</field>
</connectorSpecific>
<fetchUrl>/test/test6.txt</fetchUrl>
<url>/test/test6.txt</url>
<lastModified>Mon May 28 15:31:16 CST 2018</lastModified>
<acls>
<acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xxx.azuredatalakestore.net\41599999-13e0-4431-9b35-d2da6e9ccee8" name="user:41599999-13e0-4431-9b35-d2da6e9ccee8:rwx" scope="global"/>
<acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xx.azuredatalakestore.net\" name="group::rwx" scope="global"/>
</acls>
<displayUrl>/test/test6.txt</displayUrl>
<action>add</action>
<docType>item</docType>
<sourceName>aspire-azuredatalakestore</sourceName>
<sourceType>azureDataLakeStore</sourceType>
<sourceId>aspire_azuredatalakestore</sourceId>
<repItemType>aspire/file</repItemType>
<hierarchy>
<item id="2FED9DB88E9569860C5F71054971EC21" level="3" name="test6.txt" url="/test/test6.txt">
<ancestors>
<ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="2" name="test" parent="true" type="aspire/folder" url="/test"/>
<ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" type="aspire/folder" url="/"/>
</ancestors>
</item>
</hierarchy>
<contentType source="ExtractTextStage/Content-Type">text/plain; charset=UTF-8</contentType>
<extension source="ExtractTextStage">
<field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
<field name="Content-Encoding">UTF-8</field>
<field name="resourceName">/test/test6.txt</field>
</extension>
<content source="ExtractTextStage"><![CDATA[
]]></content>
<contentLength source="ExtractTextStage">1</contentLength>
</doc>
</job> |