我目前正在使用map reduce编写 Parquet ,我将行组的大小配置为256M,hdfs块的大小也配置为256M。每个文件的输出文件大小约为1G。

因此,我应该在生成的文件中包含4个行组。但是当我使用时:
parquet-tools meta path/to/my/file | grep "row group"
它为我提供了63个行组,它们具有不同的大小和行数:

row group 1:                      RC:69816 TS:244168913
row group 2:                      RC:35111 TS:117407826
row group 3:                      RC:18488 TS:60107388
row group 4:                      RC:10357 TS:33260415
row group 5:                      RC:7905 TS:24956045
row group 6:                      RC:4754 TS:15149122
row group 7:                      RC:3862 TS:12476651
row group 8:                      RC:2738 TS:9001631
row group 9:                      RC:2104 TS:7120040
row group 10:                     RC:1910 TS:6398391
row group 11:                     RC:1508 TS:5219072
row group 12:                     RC:1386 TS:4676154
row group 13:                     RC:1124 TS:3950635
row group 14:                     RC:999 TS:3518545
row group 15:                     RC:865 TS:3121657
row group 16:                     RC:774 TS:2801614
row group 17:                     RC:678 TS:2490904
row group 18:                     RC:511 TS:1996167
row group 19:                     RC:69808 TS:243894989
row group 20:                     RC:30176 TS:99585195
row group 21:                     RC:20678 TS:67779524
row group 22:                     RC:10743 TS:34547874
row group 23:                     RC:8258 TS:26080110
row group 24:                     RC:5227 TS:16456577
row group 25:                     RC:4136 TS:13321721
row group 26:                     RC:3207 TS:10272043
row group 27:                     RC:2437 TS:8107932
row group 28:                     RC:1945 TS:6563867
row group 29:                     RC:1561 TS:5320028
row group 30:                     RC:1389 TS:4809485
row group 31:                     RC:1206 TS:4251584
row group 32:                     RC:996 TS:3581746
row group 33:                     RC:895 TS:3203224
row group 34:                     RC:757 TS:2869939
row group 35:                     RC:653 TS:2550716
row group 36:                     RC:531 TS:2008746
row group 37:                     RC:69706 TS:244420245
row group 38:                     RC:32703 TS:109391929
row group 39:                     RC:18640 TS:60918458
row group 40:                     RC:10737 TS:34272225
row group 41:                     RC:7812 TS:24814707
row group 42:                     RC:5176 TS:16206655
row group 43:                     RC:4123 TS:13224377
row group 44:                     RC:3391 TS:10946649
row group 45:                     RC:2138 TS:7248145
row group 46:                     RC:1960 TS:6566944
row group 47:                     RC:1538 TS:5294523
row group 48:                     RC:1355 TS:4744634
row group 49:                     RC:1225 TS:4194625
row group 50:                     RC:1026 TS:3587484
row group 51:                     RC:877 TS:3134267
row group 52:                     RC:785 TS:2846718
row group 53:                     RC:675 TS:2546836
row group 54:                     RC:538 TS:2016450
row group 55:                     RC:69762 TS:244915809
row group 56:                     RC:32390 TS:108310300
row group 57:                     RC:18095 TS:58754777
row group 58:                     RC:10759 TS:34405301
row group 59:                     RC:8195 TS:26029310
row group 60:                     RC:5286 TS:16597963
row group 61:                     RC:4231 TS:13415076
row group 62:                     RC:3538 TS:11465640
row group 63:                     RC:135 TS:688850

行组有一个递归模式,有人知道为什么 Parquet 不符合我配置的行组大小(256M)吗?

最佳答案

使用Parquet-MR编写Parquet文件时,这是一个未解决的问题。该算法未考虑压缩,因此创建了比预期更多的行组。

您可以在这里找到有关它的更多信息:
https://issues.apache.org/jira/browse/PARQUET-1337

关于hadoop - 拼写过多的行组比文件中预期的要多,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46984387/

10-12 05:00