本文介绍了Spark 1.6:过滤由 describe() 生成的 DataFrames的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
当我在 DataFrame 上调用 describe
函数时出现问题:
The problem arises when I call describe
function on a DataFrame:
val statsDF = myDataFrame.describe()
调用描述函数产生以下输出:
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
我可以通过调用statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
我现在想从 statsDF
获得标准偏差和平均值,但是当我尝试通过执行以下操作来收集值时:
I would like now to get the standard deviation and the mean from statsDF
, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
我收到 Task not serializable
异常.
我打电话时也面临同样的异常:
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
看起来我们不能过滤describe()
函数生成的DataFrames?
It looks like we cannot filter DataFrames generated by describe()
function?
推荐答案
我考虑了一个包含一些健康疾病数据的玩具数据集
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
这篇关于Spark 1.6:过滤由 describe() 生成的 DataFrames的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!