本文介绍了Parquet vs ORC vs ORC with Snappy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对 Hive 可用的存储格式进行一些测试,并使用 Parquet 和 ORC 作为主要选项.我包括一次默认压缩的 ORC 和一次 Snappy.

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

我读过很多文档,说 Parquet 在时间/空间复杂度上比 ORC 更好,但我的测试与我浏览过的文档相反.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

关注我的数据的一些细节.

Follows some details of my data.

Table A- Text File Format- 2.5GB

Table B - ORC - 652MB

Table C - ORC with Snappy - 802MB

Table D - Parquet - 1.9 GB

就我的桌子的压缩而言,Parquet 是最糟糕的.

Parquet was worst as far as compression for my table is concerned.

我对上述表格的测试产生了以下结果.

My tests with the above tables yielded following results.

行数操作

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

列操作的总和

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

列操作的平均值

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

使用 where 子句从给定范围中选择 4 列

Text Format Cumulative CPU -  72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

这是否意味着 ORC 比 Parquet 快?或者我可以做些什么来使其在查询响应时间和压缩率方面更好地工作?

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

谢谢!

推荐答案

我想说,这两种格式都有自己的优点.

I would say, that both of these formats have their own advantages.

如果您有高度嵌套的数据,Parquet 可能会更好,因为它像 Google Dremel 那样将其元素存储为树(见这里).
如果您的文件结构扁平化,Apache ORC 可能会更好.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

据我所知,parquet 还不支持索引.ORC 带有一个轻量级的索引,并且自 Hive 0.14 起有一个额外的布隆过滤器,这可能有助于缩短查询响应时间,尤其是在求和运算方面.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

Parquet 默认压缩是 SNAPPY.表 A - B - C 和 D 是否持有相同的数据集?如果是的话,当它只压缩到 1.9 GB 时,它看起来有些阴暗

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

这篇关于Parquet vs ORC vs ORC with Snappy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-30 20:45