Hive外部表可以检测HDFS中的新Parquet文件

Hive外部表可以检测HDFS中的新Parquet文件

本文介绍了Hive外部表可以检测HDFS中的新Parquet文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用与Spark捆绑的Hive。我的Spark流作业每个批处理作业将250个Parquet文件写入HDFS,格式为/hdfs/nodes/part-r-$partition_num-$job_hash.gz.parquet。这意味着在1个作业之后,我有250个HDFS文件,而在2个之后,我有500个。使用Parquet创建的我的外部Hive表指向/ hdfs / nodes作为其位置,但它不会更新以包含在重新运行程序后,新文件中的数据。



Hive外部表包含表中的新文件,还是只更新表中存在的现有文件

另请参阅我的相关问题使用Hive自动更新表格。

这有点儿但我最终让Hive使用新分区和 MSCK REPAIR TABLE tablename 来检测新文件,这些文件在创建完成后检测新分区。



这并不能解决原始问题,因为每次我想要新文件时都必须创建一个新分区在Hive中,但它允许我前进。


I am using Hive bundled with Spark. My Spark Streaming job writes 250 Parquet files to HDFS per batch job, in the form of /hdfs/nodes/part-r-$partition_num-$job_hash.gz.parquet. This means that after 1 job, I have 250 files in HDFS, and after 2, I have 500. My external Hive table, created using Parquet, points at /hdfs/nodes for it's location, but it doesn't update to include the data in the new files after I rerun the program.

Do Hive external tables include new files in the table, or only updates to existing files that were there when the table was made?

Also see my related question about automatically updating tables using Hive.

解决方案

This is a bit of a hack, but I did eventually get Hive to detect new files using new partitions and MSCK REPAIR TABLE tablename, which detects the new partitions after they have been created.

This does not fix the original issue, as I have to create a new partition each time I have new files I want in Hive, but it has allowed me to move forward.

这篇关于Hive外部表可以检测HDFS中的新Parquet文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 16:12