我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表？

本文介绍了我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Hive 2.1

我有以下表格定义：

  CREATE EXTERNAL TABLE table_snappy（
a STRING，
b INT）
 PARTITIONED BY（c STRING）
 ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io .parquet.serde.ParquetHiveSerDe'
存储为输入文件
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 OUTPUTFORMAT 
'org.apache.hadoop .hive.ql.io.parquet.MapredParquetOutputFormat'
 LOCATION'/'
 TBLPROPERTIES（'parquet.compress'='SNAPPY'）;

现在，我想将数据插入它：

  INSERT INTO table_snappy PARTITION（c ='something'）VALUES（'xyz'，1）;

但是，当我查看数据文件时，我看到的只是简单的parquet文件，没有任何压缩。如何在这种情况下启用快速压缩？

目标：以parquet格式和SNAPPY压缩配置表格数据。

我曾尝试设置多个属性：

SET镶木地板.compression = SNAPPY; SET hive.exec.compress.output = true; SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type = BLOCK; SET mapreduce.output.fileoutputformat.compress = true; SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; SET PARQUET_COMPRESSION_CODEC = snappy;
以及

TBLPROPERTIES（'parquet.compression'='SNAPPY'）;
但没有任何帮助。我尝试了与GZIP压缩相同的功能，并且它看起来没有工作。我开始考虑是否有可能。任何帮助表示赞赏。
解决方案
检查压缩与否的最佳方法之一是使用 parquet-工具。
创建外部表testparquet（id int，名称字符串）存储为镶木地板 location'/ user / cloudera / testparquet /' tblproperties（'parquet.compression'='SNAPPY'）; 插入testparquet值（1，'Parquet'）;
现在，当您查看文件时，它可能没有 .snappy 任何地方
[cloudera @ quickstart〜] $ hdfs dfs -ls / user / cloudera / testparquet 找到1件商品 -rwxr-xr-x 1 anonymous supergroup 323 2018-03-02 01:07 / user / cloudera / testparquet / 000000_0

让我们进一步检查... ...

pre $ [cloudera @ quickstart〜 ] $ hdfs dfs -get / user / cloudera / testparquet / * [cloudera @ quickstart〜] $ parquet-tools meta 000000_0 创建者：parquet-mr版本1.5.0-cdh5.12.0（build $ {buildNumber}）文件架构：hive_schema ------------------------------ -------------------------------------------------- -------------------------------------------------- --------------------------- id：可选INT32 R：0 D：1 名称：可选二进制O ：UTF8 R：0 D：1 行组1：RC：1 TS：99 -------------------- --------------------------- -------------------------------------------------- -------------------------------------------------- ---------- id：INT32 SNAPPY DO：0 FPO：4 SZ：45/43 / 0.96 VC：1 ENC：PLAIN，RLE，BIT_PACKED 名称：BINARY SNAPPY DO ：0 FPO：49 SZ：58/56 / 0.97 VC：1 ENC：PLAIN，RLE，BIT_PACKED [cloudera @ quickstart〜] $
它是 snappy 压缩。

Hive 2.1
I have following table definition :
CREATE EXTERNAL TABLE table_snappy ( a STRING, b INT) PARTITIONED BY (c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/' TBLPROPERTIES ('parquet.compress'='SNAPPY');
Now, I would like to insert data into it :
INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1);
However, when I look into the data file, all I see is plain parquet file without any compression. How can I enable snappy compression in this case?
Goal : To have hive table data in parquet format and SNAPPY compressed.
I have tried setting multiple properties as well :
SET parquet.compression=SNAPPY; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK; SET mapreduce.output.fileoutputformat.compress=true; SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET PARQUET_COMPRESSION_CODEC=snappy;
as well as
TBLPROPERTIES ('parquet.compression'='SNAPPY');
but nothing is being helpful. I tried the same with GZIP compression and it seem to be not working as well. I am starting to think if it's possible or not. Any help is appreciated.
解决方案
One of the best ways to check if it is compressed or not, is by using parquet-tools.
create external table testparquet (id int, name string) stored as parquet location '/user/cloudera/testparquet/' tblproperties('parquet.compression'='SNAPPY'); insert into testparquet values(1,'Parquet');
Now when you look at the file, it may not have .snappy anywhere
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/testparquet Found 1 items -rwxr-xr-x 1 anonymous supergroup 323 2018-03-02 01:07 /user/cloudera/testparquet/000000_0
Let's inspect it further...
[cloudera@quickstart ~]$ hdfs dfs -get /user/cloudera/testparquet/* [cloudera@quickstart ~]$ parquet-tools meta 000000_0 creator: parquet-mr version 1.5.0-cdh5.12.0 (build ${buildNumber}) file schema: hive_schema ------------------------------------------------------------------------------------------------------------------------------------------------------------- id: OPTIONAL INT32 R:0 D:1 name: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:1 TS:99 ------------------------------------------------------------------------------------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:45/43/0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED name: BINARY SNAPPY DO:0 FPO:49 SZ:58/56/0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED [cloudera@quickstart ~]$
it is snappy compressed.

这篇关于我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！