问题描述
我在使用Ambari的spark 2.1.1和hadoop 2.6时遇到了问题。我首先在本地计算机上测试了我的代码(单节点,本地文件),一切都按预期工作:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName( 'localTest')\
.getOrCreate()
data = spark.read.format('orc')。load('mydata / *。orc')
data .select('colname')。na.drop()。describe(['colname'])。show()
+ ------- + ------- ----------- +
| summary | colname |
+ ------- + ------------------ +
|算| 1688264 |
|意味着| 17.963293650793652 |
| STDDEV | 5.9136724822401425 |
|分| 0.5 |
| MAX | 87.5 |
+ ------- + ------------------ +
这些数据看似合理。
现在我将数据上传到hadoop群集(ambari setup,yarn,11个节点)并使用 hadoop fs -put / home / username / mydata / mydata
将它推送到hdfs中。现在我测试了相同的代码以下表结束:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName('localTest')\
.getOrCreate ()
data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname') .na.drop()。describe(['colname'])。show()
+ ------- + -------------- ---- +
| summary | colname |
+ ------- + ------------------ +
|算| 2246009 |
|意味着| 1525.5387403802445 |
| STDDEV | 16250.611372902456 |
|分| -413050.0 |
| MAX | 1.6385821E7 |
+ ------- + ------------------ +
但是,另一件事完全让我困惑 - >如果我将 mydata / *。orc
更改为 mydata / any_single_file.orc
和 hdfs:///mydata/*.orc
至 hdfs:/// mydata / any_single_file。 orc
这两个表(集群,本地pc)都是一样的...
有人知道更多关于这种奇怪的行为吗?
非常感谢!
搜索解决方案我认为在某些文件中,模式有点不同(一列或多或少),虽然在实木复合地图中实现了一个模式合并,但是现在orc不支持模式合并..
所以我的解决方法是一个接一个地加载orc文件,然后我使用 df.write.parquet()
方法来转换它们。转换完成后。我可以在文件路径中使用* .parquet而不是* .orc加载它们。
I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName('localTest')\
.getOrCreate()
data = spark.read.format('orc').load('mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()
+-------+------------------+
|summary| colname |
+-------+------------------+
| count| 1688264|
| mean|17.963293650793652|
| stddev|5.9136724822401425|
| min| 0.5|
| max| 87.5|
+-------+------------------+
These values are toally plausible.
Now I uploaded my data to a hadoop cluster (ambari setup, yarn, 11 nodes) and pushed it into the hdfs using hadoop fs -put /home/username/mydata /mydata
Now I tested the same code which ended with the following table:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName('localTest')\
.getOrCreate()
data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()
+-------+------------------+
|summary| colname |
+-------+------------------+
| count| 2246009|
| mean|1525.5387403802445|
| stddev|16250.611372902456|
| min| -413050.0|
| max| 1.6385821E7|
+-------+------------------+
But another thing is confusing completly me -> if I change mydata/*.orc
to mydata/any_single_file.orc
and hdfs:///mydata/*.orc
to hdfs:///mydata/any_single_file.orc
both tables (cluster, local pc) are the same ...
Does anyone know more about this weird behaviour?
Thanks a lot!
After a week of searching the "solution" for me was that in some files the schema was a little bit different (a column more or less) and while there is a schema merge implemented in parquet, orc does not support a schema merge for now.. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11412
So my workaround was to load the orc files one after another and then I used the df.write.parquet()
method to convert them. After the conversion was finished. I could load them all together using *.parquet instead of *.orc in the file path.
这篇关于读取来自hdfs的ocr文件后不可思议地触发数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!