问题描述
Hive 2.1
我有以下表格定义:
CREATE EXTERNAL TABLE table_snappy(
a STRING,
b INT)
PARTITIONED BY(c STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io .parquet.serde.ParquetHiveSerDe'
存储为输入文件
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop .hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION'/'
TBLPROPERTIES('parquet.compress'='SNAPPY');
现在,我想将数据插入它:
INSERT INTO table_snappy PARTITION(c ='something')VALUES('xyz',1);
但是,当我查看数据文件时,我看到的只是简单的parquet文件,没有任何压缩。如何在这种情况下启用快速压缩?
目标:以parquet格式和SNAPPY压缩配置表格数据。
我曾尝试设置多个属性:
SET镶木地板.compression = SNAPPY;
SET hive.exec.compress.output = true;
SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type = BLOCK;
SET mapreduce.output.fileoutputformat.compress = true;
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET PARQUET_COMPRESSION_CODEC = snappy;
以及
TBLPROPERTIES('parquet.compression'='SNAPPY');
但没有任何帮助。我尝试了与GZIP压缩相同的功能,并且它看起来没有工作。我开始考虑是否有可能。任何帮助表示赞赏。
检查压缩与否的最佳方法之一是使用 parquet-工具
。
创建外部表testparquet(id int,名称字符串)
存储为镶木地板
location'/ user / cloudera / testparquet /'
tblproperties('parquet.compression'='SNAPPY');
插入testparquet值(1,'Parquet');
现在,当您查看文件时,它可能没有 .snappy
任何地方
[cloudera @ quickstart〜] $ hdfs dfs -ls / user / cloudera / testparquet
找到1件商品
-rwxr-xr-x 1 anonymous supergroup 323 2018-03-02 01:07 / user / cloudera / testparquet / 000000_0
让我们进一步检查... ...
pre $ [cloudera @ quickstart〜 ] $ hdfs dfs -get / user / cloudera / testparquet / *
[cloudera @ quickstart〜] $ parquet-tools meta 000000_0
创建者:parquet-mr版本1.5.0-cdh5.12.0(build $ {buildNumber})
文件架构:hive_schema
------------------------------ -------------------------------------------------- -------------------------------------------------- ---------------------------
id:可选INT32 R:0 D:1
名称:可选二进制O :UTF8 R:0 D:1
行组1:RC:1 TS:99
-------------------- --------------------------- -------------------------------------------------- -------------------------------------------------- ----------
id:INT32 SNAPPY DO:0 FPO:4 SZ:45/43 / 0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED
名称:BINARY SNAPPY DO :0 FPO:49 SZ:58/56 / 0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED
[cloudera @ quickstart〜] $
它是 snappy
压缩。
Hive 2.1
I have following table definition :
CREATE EXTERNAL TABLE table_snappy (
a STRING,
b INT)
PARTITIONED BY (c STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Now, I would like to insert data into it :
INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1);
However, when I look into the data file, all I see is plain parquet file without any compression. How can I enable snappy compression in this case?
Goal : To have hive table data in parquet format and SNAPPY compressed.
I have tried setting multiple properties as well :
SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET PARQUET_COMPRESSION_CODEC=snappy;
as well as
TBLPROPERTIES ('parquet.compression'='SNAPPY');
but nothing is being helpful. I tried the same with GZIP compression and it seem to be not working as well. I am starting to think if it's possible or not. Any help is appreciated.
One of the best ways to check if it is compressed or not, is by using parquet-tools
.
create external table testparquet (id int, name string)
stored as parquet
location '/user/cloudera/testparquet/'
tblproperties('parquet.compression'='SNAPPY');
insert into testparquet values(1,'Parquet');
Now when you look at the file, it may not have .snappy
anywhere
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/testparquet
Found 1 items
-rwxr-xr-x 1 anonymous supergroup 323 2018-03-02 01:07 /user/cloudera/testparquet/000000_0
Let's inspect it further...
[cloudera@quickstart ~]$ hdfs dfs -get /user/cloudera/testparquet/*
[cloudera@quickstart ~]$ parquet-tools meta 000000_0
creator: parquet-mr version 1.5.0-cdh5.12.0 (build ${buildNumber})
file schema: hive_schema
-------------------------------------------------------------------------------------------------------------------------------------------------------------
id: OPTIONAL INT32 R:0 D:1
name: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:1 TS:99
-------------------------------------------------------------------------------------------------------------------------------------------------------------
id: INT32 SNAPPY DO:0 FPO:4 SZ:45/43/0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED
name: BINARY SNAPPY DO:0 FPO:49 SZ:58/56/0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED
[cloudera@quickstart ~]$
it is snappy
compressed.
这篇关于我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!