Some of the features of the HDFS Binary Writer the HDFS Archive compactor include:
As mentioned, the binary writer writes data to files under a base directory on HDFS. Under that, a directory is created for the “content source”. This “content source” name can be extracted from the Aspire job (from a user defined location in the Aspire document) or can be set to a static value (allowing data from multiple Aspire connectors to be written to the same “content source” directory.
The id of the document is then used to determine the rest of the path to and the name of the stored file. An MD5 hash of the document id is taken. The file name is constructed from that hash (to ensure uniqueness) and a “name” taken from the document id. This “name” is the file name (if the id was a file path) or the “page name” if the id was a url. If required, this “name” may be shortened by removing a section from the middle if the path to the HDFS file would exceed the HDFS limit of 255 characters. This approach ensures that the filename is unique, but gives a “recognisable” part of the file name to aid the user to locate the file.
A configurable number of the lowest significant bits of the hash are used to generate two levels of directory in which the file will be written. This allows the distribution of the files across the directories. TwoA3/A3BC/AF0327674491B77556554227D915A3BC-0.txt
hash bits | dir bits | total directories | top level name length (chars) | lower level name length (chars) | top level directories | lower level directories | Average files per directory | |||||||||||||
1m | 5m | 10m | 25m | 50m | 100m | 250m | 500m | 1b | 5b | 10b | 25b | 50b | 100b | |||||||
8 | 4 | 256 | 1 | 2 | 16 | 16 | 3,906.25 | 19,531.25 | 39,062.50 | 97,656.25 | 195,312.50 | 390,625.00 | 976,562.50 | 1,953,125.00 | 3,906,250.00 | 19,531,250.00 | 39,062,500.00 | 97,656,250.00 | 195,312,500.00 | 390,625,000.00 |
12 | 8 | 4,096 | 1 | 3 | 16 | 256 | 244.14 | 1,220.70 | 2,441.41 | 6,103.52 | 12,207.03 | 24,414.06 | 61,035.16 | 122,070.31 | 244,140.63 | 1,220,703.13 | 2,441,406.25 | 6,103,515.63 | 12,207,031.25 | 24,414,062.50 |
12 | 4 | 4,096 | 2 | 3 | 256 | 16 | 244.14 | 1,220.70 | 2,441.41 | 6,103.52 | 12,207.03 | 24,414.06 | 61,035.16 | 122,070.31 | 244,140.63 | 1,220,703.13 | 2,441,406.25 | 6,103,515.63 | 12,207,031.25 | 24,414,062.50 |
16 | 12 | 65,536 | 1 | 4 | 16 | 4,096 | 15.26 | 76.29 | 152.59 | 381.47 | 762.94 | 1,525.88 | 3,814.70 | 7,629.39 | 15,258.79 | 76,293.95 | 152,587.89 | 381,469.73 | 762,939.45 | 1,525,878.91 |
16 | 8 | 65,536 | 2 | 4 | 256 | 256 | 15.26 | 76.29 | 152.59 | 381.47 | 762.94 | 1,525.88 | 3,814.70 | 7,629.39 | 15,258.79 | 76,293.95 | 152,587.89 | 381,469.73 | 762,939.45 | 1,525,878.91 |
16 | 4 | 65,536 | 3 | 4 | 4,096 | 16 | 15.26 | 76.29 | 152.59 | 381.47 | 762.94 | 1,525.88 | 3,814.70 | 7,629.39 | 15,258.79 | 76,293.95 | 152,587.89 | 381,469.73 | 762,939.45 | 1,525,878.91 |
20 | 16 | 1,048,576 | 1 | 5 | 16 | 65,536 | 0.95 | 4.77 | 9.54 | 23.84 | 47.68 | 95.37 | 238.42 | 476.84 | 953.67 | 4,768.37 | 9,536.74 | 23,841.86 | 47,683.72 | 95,367.43 |
20 | 12 | 1,048,576 | 2 | 5 | 256 | 4,096 | 0.95 | 4.77 | 9.54 | 23.84 | 47.68 | 95.37 | 238.42 | 476.84 | 953.67 | 4,768.37 | 9,536.74 | 23,841.86 | 47,683.72 | 95,367.43 |
20 | 8 | 1,048,576 | 3 | 5 | 4,096 | 256 | 0.95 | 4.77 | 9.54 | 23.84 | 47.68 | 95.37 | 238.42 | 476.84 | 953.67 | 4,768.37 | 9,536.74 | 23,841.86 | 47,683.72 | 95,367.43 |
20 | 4 | 1,048,576 | 4 | 5 | 65,536 | 16 | 0.95 | 4.77 | 9.54 | 23.84 | 47.68 | 95.37 | 238.42 | 476.84 | 953.67 | 4,768.37 | 9,536.74 | 23,841.86 | 47,683.72 | 95,367.43 |
24 | 20 | 16,777,216 | 1 | 6 | 16 | 1,048,576 | 0.06 | 0.30 | 0.60 | 1.49 | 2.98 | 5.96 | 14.90 | 29.80 | 59.60 | 298.02 | 596.05 | 1,490.12 | 2,980.23 | 5,960.46 |
24 | 16 | 16,777,216 | 2 | 6 | 256 | 65,536 | 0.06 | 0.30 | 0.60 | 1.49 | 2.98 | 5.96 | 14.90 | 29.80 | 59.60 | 298.02 | 596.05 | 1,490.12 | 2,980.23 | 5,960.46 |
24 | 12 | 16,777,216 | 3 | 6 | 4,096 | 4,096 | 0.06 | 0.30 | 0.60 | 1.49 | 2.98 | 5.96 | 14.90 | 29.80 | 59.60 | 298.02 | 596.05 | 1,490.12 | 2,980.23 | 5,960.46 |
24 | 8 | 16,777,216 | 4 | 6 | 65,536 | 256 | 0.06 | 0.30 | 0.60 | 1.49 | 2.98 | 5.96 | 14.90 | 29.80 | 59.60 | 298.02 | 596.05 | 1,490.12 | 2,980.23 | 5,960.46 |
24 | 4 | 16,777,216 | 5 | 6 | 1,048,576 | 16 | 0.06 | 0.30 | 0.60 | 1.49 | 2.98 | 5.96 | 14.90 | 29.80 | 59.60 | 298.02 | 596.05 | 1,490.12 | 2,980.23 | 5,960.46 |
HDFS HAR Compactor HDFS Binary Writer has been tested again Cloudera v5.10.1
Please let us know.