计算Spark DataFrame中非空值的数量 | DataFrame中非空值的数量

本文介绍了计算Spark DataFrame中非空值的数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有一些列的数据框，在进行分析之前，我想了解这种数据框的完整性，因此我想过滤该数据框并为每列计数非空值的数量，可能会返回一个数据框.

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete such data frame is, so I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.

基本上，我正在尝试达到与，但是使用Scala而不是Python ...

Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python...

说你有

val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()

如何汇总每列的非空数，并返回具有相同列数和仅一行答案的数据框?

How can you summarize the number of non-null for each column and return a dataframe with the same number of column and just a single row with the answer?

推荐答案

尽管我喜欢Psidoms的答案，但通常我对空值的分数更感兴趣，因为只有非空值的数量不能说明很多...

Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...

您可以执行以下操作:

import org.apache.spark.sql.functions.{sum,when, count}

df.agg(
   (sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
   (sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
   (sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
 ).show()

sum(when($"x".isNotNull,0).otherwise(1))也可以仅由count($"x")代替，而count($"x")仅计算非空值.由于发现不明显，我倾向于使用更清晰的sum表示法

sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear

这篇关于计算Spark DataFrame中非空值的数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！