问题描述
我有一个包含ORC文件的目录。我使用下面的代码创建一个DataFrame
$ b $ pre $ var data = sqlContext.sql(SELECT * FROM orc.` / directory /含/兽人/ files`);
它用这个模式返回数据框架
[_ col0:int,_col1:bigint]
预期的模式是
$ p $ [scan_nbr:int,visit_nbr:bigint]
当我以parquet格式查询文件时,我得到正确的模式。
我是否缺少任何配置s)?
添加更多详细信息
这是Hortonworks Distribution HDP 2.4。 2(Spark 1.6.1,Hadoop 2.7.1,Hive 1.2.1)
我们没有改变HDP的默认配置,但这绝对不一样作为普通的Hadoop版本。
数据由上游Hive作业,一个简单的CTAS(CREATE TABLE样例STORED AS ORC作为SELECT ...)编写。 p>
我使用最新的2.0.0 hive&它保留了orc文件中的列名称。
问题是Hive版本,它是1.2.1,此错误
这固定在2.0.0。
I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
这篇关于ORC文件上的Spark SQL不会返回正确的架构(列名称)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!