问题描述
我有一个非常沉重的镶木地板文件,需要在其中更改其中一列的值.一种方法是在源文本文件中更新这些值并重新创建镶木地板文件,但我想知道是否有更便宜且更简单的解决方案.
I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
推荐答案
让我们从基础知识入手:
Lets start with basics:
镶木地板是一种文件格式,需要保存在文件系统中.
Parquet is a file format that needs to be saved in a file system.
关键问题:
- 实木复合地板是否支持
追加
操作? - 文件系统(即HDFS)是否允许对文件进行
追加
? - 作业框架(Spark)可以实现
append
操作吗?
- Does parquet support
append
operations? - Does the file system (namely, HDFS) allow
append
on files? - Can the job framework (Spark) implement
append
operations?
答案:
-
parquet.hadoop.ParquetFileWriter
仅支持CREATE
和OVERWRITE
;没有append
模式.(不确定,但是在其他实现中这可能会改变-实木复合地板设计确实支持append
)
parquet.hadoop.ParquetFileWriter
only supportsCREATE
andOVERWRITE
; there is noappend
mode. (Not sure but this could potentially change in other implementations -- parquet design does supportappend
)
HDFS允许使用 dfs.support.append
属性
HDFS allows append
on files using the dfs.support.append
property
Spark框架不支持将 append
附加到现有的镶木地板文件中,并且没有计划;参见此JIRA
Spark framework does not support append
to existing parquet files, and with no plans to; see this JIRA
更多详细信息在这里:
http://bytepadding.com/linux/understanding-basics-文件系统/
这篇关于更新Apache Parquet文件中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!