...
Connector will CRAWL Folders and Files (configuration dependant) and will pull the following metadata
...
The Azure Data Lake connector will crawl files and folders (configuration-dependent). Execution will result in populating the following fields:
Property | Type | Description |
---|---|---|
Fullname | String | Full path of the directory or file (from root path "/") |
Name | String | File name (minus the path) of the directory or file |
Length | Long | Length of a file (does not apply for directories) |
Group | String | ID of the group that owns this file/directory |
User | String | ID of the user that owns this file/directory |
Permission | String | Unix-style permission string for this file or directory |
Last Access Time | Date | Date and time of when the file was last accessed |
AclBit | Boolean | Flag indicating if the file has ACLs set on it |
Block Size | Long | Block size reported by server |
Expiry Time | Date | Date and time when the file expires, as UTC time |
ReplicationFactor | Int | Replication factor reported by server |
isContainer | Boolean | Indicates "true" if is a directory, otherwise File |
Fetch Url | String | Azure Data Lake full Absolute Path including FQDN. adl://[yourdomain].azuredatalakestore.net/full/path/to.file |
Last Modified Date | Date | Date and time of when the file was last modified |
Acls | ACL Array | List of access for file or folder |
The following code block shows the console output of crawling of a folder called /test
located at root of testing Data Lake Storage adl://dlsjose.azuredatalakestore.net
Code Block | ||
---|---|---|
| ||
2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Received job - action: start 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing crawl: |
...
1528133426127 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Initializing statistics for crawl: |
...
1528133426127 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Clearing queues, snapshot, hierarchy and intersection acls - please wait... 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: |
...
Clearing queues, snapshot, hierarchy and intersection acls took 200 ms 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/ |
...
Main/ |
...
CrawlController]: |
...
Offering crawl |
...
root 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/ |
...
Main/ |
...
CrawlController]: Sending |
...
start job for crawl: 1528133426127 (status: I) 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ScanQueueLoader]: QueueLoader (scan) |
...
crawl status |
...
checker thread started 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ |
...
ProcessQueueLoader]: QueueLoader ( |
...
process) |
...
item |
...
claim |
...
thread started
2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/ |
...
QueuePipelineManager/ |
...
ScanQueueLoader]: |
...
QueueLoader (scan) item claim thread started 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/ |
...
QueuePipelineManager/ |
...
ProcessQueueLoader]: QueueLoader |
...
(process) crawl |
...
status checker thread started 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Published crawl start job 2018-06- |
...
04T17: |
...
30: |
...
26Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: [/test] 2018-06- |
...
04T17: |
...
30: |
...
28Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl://dlsjose.azuredatalakestore.net/test 2018-06- |
...
04T17: |
...
30: |
...
28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager/ProcessCrawlRoot]: Added root item: /test 2018-06- |
...
04T17: |
...
30: |
...
28Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl://dlsjose.azuredatalakestore.net/test 2018-06- |
...
04T17: |
...
30: |
...
29Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Scanning: /test 2018-06- |
...
04T17: |
...
30: |
...
30Z INFO [/aspire_azuredatalakestore/RAP]: >>> Scan Item - Azure DataLake Store: /test 2018-06- |
...
04T17: |
...
30: |
...
31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/ |
...
/dlsjose.azuredatalakestore.net/test/ |
...
NOACCESS 2018-06- |
...
04T17: |
...
30: |
...
31Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/ |
...
/ |
...
dlsjose.azuredatalakestore.net/test/subtest 2018-06- |
...
04T17: |
...
30: |
...
31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: |
...
adl://dlsjose.azuredatalakestore.net/test/NOACCESS 2018-06- |
...
04T17: |
...
30:31Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: adl:/ |
...
/ |
...
dlsjose.azuredatalakestore.net/test/subtest 2018-06- |
...
04T17: |
...
30: |
...
31Z INFO [/aspire_azuredatalakestore/RAP]: >>> |
...
Processing |
...
crawl - Azure DataLake Store: |
...
adl://dlsjose.azuredatalakestore.net/test/test4.txt 2018-06- |
...
04T17: |
...
30: |
...
32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/ |
...
/dlsjose.azuredatalakestore.net/test/test5.txt 2018-06- |
...
04T17: |
...
30: |
...
32Z INFO [/aspire_azuredatalakestore/RAP]: >>> Processing crawl - Azure DataLake Store: adl:/ |
...
/ |
...
dlsjose.azuredatalakestore.net/test/test6.txt 2018-06- |
...
04T17: |
...
30: |
...
32Z INFO [/aspire_azuredatalakestore/ScanPipelineManager/Scan]: Item /test |
...
scanned |
...
5 subitems 2018-06- |
...
04T17: |
...
30: |
...
32Z INFO [/aspire_azuredatalakestore |
...
/ |
...
ProcessPipelineManager]: Processing: adl:/ |
...
/ |
...
dlsjose.azuredatalakestore.net/test/test4.txt 2018-06- |
...
04T17: |
...
30: |
...
32Z INFO [/aspire_azuredatalakestore/ |
...
ProcessPipelineManager]: |
...
Processing: adl://dlsjose.azuredatalakestore.net/test/test5.txt 2018-06- |
...
04T17: |
...
30:32Z INFO [/aspire_azuredatalakestore/ProcessPipelineManager]: Processing: |
...
adl:/ |
...
/dlsjose.azuredatalakestore.net/test/test6.txt 2018-06- |
...
04T17: |
...
30: |
...
34Z INFO [/aspire_azuredatalakestore/ |
...
QueuePipelineManager/ |
...
ScanQueueLoader]: |
...
QueueLoader (scan) crawl |
...
status thread |
...
stopped 2018-06- |
...
04T17: |
...
30: |
...
34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ProcessQueueLoader]: QueueLoader (process) crawl status thread stopped 2018-06- |
...
04T17: |
...
30: |
...
34Z INFO [/aspire_azuredatalakestore/ |
...
Main/ |
...
CrawlController]: |
...
Published crawl |
...
end |
...
job 2018-06- |
...
04T17: |
...
30: |
...
34Z INFO [/aspire_azuredatalakestore/Main/CrawlController]: Crawl ended with status: S 2018-06- |
...
04T17: |
...
30: |
...
34Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ |
...
ProcessQueueLoader]: QueueLoader ( |
...
process) item claim thread stopped 2018-06- |
...
04T17: |
...
30: |
...
35Z INFO [/aspire_azuredatalakestore/QueuePipelineManager/ |
...
ScanQueueLoader]: QueueLoader ( |
...
scan) item claim thread stopped |
...
Output for META DATA found for a file
Code Block | ||
---|---|---|
| ||
<job id="192.168.56.1:50505/2018-06-01T15:56:21Z/1/18" time="2018-06-01T17:02:45Z">
<doc>
<qid>/test/test6.txt</qid>
<id>/test/test6.txt</id>
<connectorSpecific>
<field name="fullname">/test/test6.txt</field>
<field name="length">0</field>
<field name="group">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
<field name="user">3b891abc-b0d4-4c57-8231-b5b48ff8f912</field>
<field name="permission">770</field>
<field name="lastAccessTime">Mon May 28 15:31:16 CST 2018</field>
<field name="aclBit">true</field>
<field name="blocksize">268435456</field>
<field name="expiryTime"/>
<field name="replicationFactor">1</field>
<field name="isContainer">false</field>
</connectorSpecific>
<fetchUrl>/test/test6.txt</fetchUrl>
<url>/test/test6.txt</url>
<lastModified>Mon May 28 15:31:16 CST 2018</lastModified>
<acls>
<acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xxx.azuredatalakestore.net\41599999-13e0-4431-9b35-d2da6e9ccee8" name="user:41599999-13e0-4431-9b35-d2da6e9ccee8:rwx" scope="global"/>
<acl access="allow" domain="xxx.azuredatalakestore.net" entity="user" fullname="xx.azuredatalakestore.net\" name="group::rwx" scope="global"/>
</acls>
<displayUrl>/test/test6.txt</displayUrl>
<action>add</action>
<docType>item</docType>
<sourceName>aspire-azuredatalakestore</sourceName>
<sourceType>azureDataLakeStore</sourceType>
<sourceId>aspire_azuredatalakestore</sourceId>
<repItemType>aspire/file</repItemType>
<hierarchy>
<item id="2FED9DB88E9569860C5F71054971EC21" level="3" name="test6.txt" url="/test/test6.txt">
<ancestors>
<ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="2" name="test" parent="true" type="aspire/folder" url="/test"/>
<ancestor id="6666CD76F96956469E7BE39D750CC7D9" level="1" type="aspire/folder" url="/"/>
</ancestors>
</item>
</hierarchy>
<contentType source="ExtractTextStage/Content-Type">text/plain; charset=UTF-8</contentType>
<extension source="ExtractTextStage">
<field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
<field name="Content-Encoding">UTF-8</field>
<field name="resourceName">/test/test6.txt</field>
</extension>
<content source="ExtractTextStage"><![CDATA[
]]></content>
<contentLength source="ExtractTextStage">1</contentLength>
</doc>
</job> |
...