问题描述
Spark 解释 parquet 列的方式存在一些问题.
Having some issues with the way that Spark is interpreting columns for parquet.
我有一个带有确认架构的 Oracle 源代码(df.schema() 方法):
I have an Oracle source with confirmed schema (df.schema() method):
root
|-- LM_PERSON_ID: decimal(15,0) (nullable = true)
|-- LM_BIRTHDATE: timestamp (nullable = true)
|-- LM_COMM_METHOD: string (nullable = true)
|-- LM_SOURCE_IND: string (nullable = true)
|-- DATASET_ID: decimal(38,0) (nullable = true)
|-- RECORD_ID: decimal(38,0) (nullable = true)
然后保存为 Parquet - df.write().parquet() 方法 - 具有相应的消息类型(由 Spark 确定):
Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):
message spark_schema {
optional int64 LM_PERSON_ID (DECIMAL(15,0));
optional int96 LM_BIRTHDATE;
optional binary LM_COMM_METHOD (UTF8);
optional binary LM_SOURCE_IND (UTF8);
optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}
然后我的应用程序使用 HashMap 生成表 DDL 进行类型转换,例如:
My application then generates the table DDL using a HashMap for type conversion, for example:
CREATE EXTERNAL TABLE IF NOT EXISTS
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'
我的问题是 Impala 将无法读取该表,因为它不接受 LM_PERSON_ID 作为十进制字段.如果此列设置为 BIGINT,该表将仅读取 parquet 文件.
My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.
Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'.
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]
我如何知道何时用十进制字段替换 BIGINT?
How do I know when to substitute a Decimal field for BIGINT?
Parquet 消息类型已记录但无法访问?
The parquet message type is logged but not accessible?
两个十进制字段转换为fixed_len_byte_array(16),LM_PERSON_ID转换为int64
Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64
我能想到的唯一解决办法是建表,测试是否返回,如果不返回,将十进制字段一一替换为BIGINT,每次都测试.
The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.
我在这里错过了什么?我可以为十进制的镶木地板文件强制执行架构吗?
What am I missing here? Can I enforce a schema for the parquet file for decimal?
推荐答案
来自 配置 部分 ParquetApache Spark 官方文档中的文件:
spark.sql.parquet.writeLegacyFormat(默认:false
)
如果为 true,数据将以 Spark 1.4 及更早版本的方式写入.例如,十进制值将以 Apache Parquet 的固定长度字节数组格式写入,其他系统(如 Apache Hive 和 Apache Impala)使用这种格式.如果为 false,则将使用 Parquet 中较新的格式.例如,小数将以基于 int 的格式写入.如果 Parquet 输出旨在用于不支持这种较新格式的系统,请设置为 true.
If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.
在官方文档更新之前给出的答案
由 Spark 编写的非常相似的 SPARK-20297 Parquet Decimal(12,2) 是Hive 和 Impala 无法读取 最近 (20/Apr/17 01:59) 已解决为 Not A Problem.
Answer Given Before Official Docs Got Updated
The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.
重点是使用 spark.sql.parquet.writeLegacyFormat
属性并以旧格式编写镶木地板元数据(我在 配置 并报告为 SPARK-20937).
The main point is to use spark.sql.parquet.writeLegacyFormat
property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).
启用 spark.sql.parquet.writeLegacyFormat 后,Hive 和 Impala 可以读取 Spark 写入的数据.
它确实遵循较新的标准 - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal 我错过了文档.那不会是 Impala 或 Hive 中的错误吗?
It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?
int32/int64 选项出现在十进制规范的原始版本中,只是没有被广泛实现:https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289.所以它不是一个新/旧版本的东西,它只是许多系统没有实现的替代表示.
The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.
这个 SPARK-10400 也很有帮助(关于spark.sql.parquet.writeLegacyFormat
属性的历史):
This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat
property):
我们在 SPARK-6777 中致力于实现 Parquet 向后兼容规则时引入了 SQL 选项spark.sql.parquet.followParquetFormatSpec".它指示我们是否应该使用 Spark 1.4 及之前版本采用的旧 Parquet 格式或 parquet-format 规范中定义的标准格式.然而,这个选项的名称有点令人困惑,因为我们不应该遵循规范并不是非常直观.将其重命名为spark.sql.parquet.writeLegacyFormat"并反转其默认值(它们具有相反的含义)会很好.请注意,此选项不是公开的"(isPublic 为 false).
这篇关于为什么Impala 在Spark SQL 写入后无法读取parquet 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!