读取来自hdfs的ocr文件后不可思议地触发数据框

本文介绍了读取来自hdfs的ocr文件后不可思议地触发数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在使用Ambari的spark 2.1.1和hadoop 2.6时遇到了问题。我首先在本地计算机上测试了我的代码（单节点，本地文件），一切都按预期工作：

  from pyspark.sql import SparkSession 
 
 spark = SparkSession\ 
 .builder\ 
 .master（'yarn'）\ 
 .appName（ 'localTest'）\ 
 .getOrCreate（）
 
 data = spark.read.format（'orc'）。load（'mydata / *。orc'）
 data .select（'colname'）。na.drop（）。describe（['colname']）。show（）
 
 + ------- + ------- ----------- + 
 | summary | colname | 
 + ------- + ------------------ + 
 |算| 1688264 | 
 |意味着| 17.963293650793652 | 
 | STDDEV | 5.9136724822401425 | 
 |分| 0.5 | 
 | MAX | 87.5 | 
 + ------- + ------------------ +

这些数据看似合理。

现在我将数据上传到hadoop群集（ambari setup，yarn，11个节点）并使用 hadoop fs -put / home / username / mydata / mydata

将它推送到hdfs中。现在我测试了相同的代码以下表结束：

from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .master（'yarn'）\ .appName（'localTest'）\ .getOrCreate （） data = spark.read.format（'orc'）.load（'hdfs：///mydata/*.orc'） data.select（'colname'） .na.drop（）。describe（['colname']）。show（） + ------- + -------------- ---- + | summary | colname | + ------- + ------------------ + |算| 2246009 | |意味着| 1525.5387403802445 | | STDDEV | 16250.611372902456 | |分| -413050.0 | | MAX | 1.6385821E7 | + ------- + ------------------ +
但是，另一件事完全让我困惑 - >如果我将 mydata / *。orc 更改为 mydata / any_single_file.orc 和 hdfs：///mydata/*.orc 至 hdfs：/// mydata / any_single_file。 orc 这两个表（集群，本地pc）都是一样的...

有人知道更多关于这种奇怪的行为吗？

非常感谢！
解决方案
搜索解决方案我认为在某些文件中，模式有点不同（一列或多或少），虽然在实木复合地图中实现了一个模式合并，但是现在orc不支持模式合并..

所以我的解决方法是一个接一个地加载orc文件，然后我使用 df.write.parquet（）方法来转换它们。转换完成后。我可以在文件路径中使用* .parquet而不是* .orc加载它们。

I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           1688264|
|   mean|17.963293650793652|
| stddev|5.9136724822401425|
|    min|               0.5|
|    max|              87.5|
+-------+------------------+

These values are toally plausible.

Now I uploaded my data to a hadoop cluster (ambari setup, yarn, 11 nodes) and pushed it into the hdfs using hadoop fs -put /home/username/mydata /mydata

Now I tested the same code which ended with the following table:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           2246009|
|   mean|1525.5387403802445|
| stddev|16250.611372902456|
|    min|         -413050.0|
|    max|       1.6385821E7|
+-------+------------------+

But another thing is confusing completly me -> if I change mydata/*.orc to mydata/any_single_file.orc and hdfs:///mydata/*.orc to hdfs:///mydata/any_single_file.orc both tables (cluster, local pc) are the same ...

Does anyone know more about this weird behaviour?

Thanks a lot!

解决方案

After a week of searching the "solution" for me was that in some files the schema was a little bit different (a column more or less) and while there is a schema merge implemented in parquet, orc does not support a schema merge for now.. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11412

So my workaround was to load the orc files one after another and then I used the df.write.parquet() method to convert them. After the conversion was finished. I could load them all together using *.parquet instead of *.orc in the file path.

这篇关于读取来自hdfs的ocr文件后不可思议地触发数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！