如何在PySpark DataFrame中删除具有空值的所有列?

本文介绍了如何在PySpark DataFrame中删除具有空值的所有列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的数据集，我想删除包含null值的列并返回一个新的数据框.我该怎么办?

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?

以下内容仅删除包含null的单个列或行.

The following only drops a single column or rows containing null.

df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null

例如

a |  b  | c
1 |     | 0
2 |  2  | 3

在上述情况下，它将丢弃整列B，因为其值之一为空.

In the above case it will drop the whole column B because one of its values is empty.

推荐答案

以下是删除所有具有NULL值的列的一种可能方法:请参阅，以获取每列NULL值计数代码的来源.

Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.

import pyspark.sql.functions as F

# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
                   'x2': ['b', None, '2'],
                   'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()

def drop_null_columns(df):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > 0]
    df = df.drop(*to_drop)
    return df

# Drops column b2, because it contains null values
drop_null_columns(df).show()

之前:

+---+----+---+
| x1|  x2| x3|
+---+----+---+
|  a|   b|  c|
|  1|null|  0|
|  2|   2|  3|
+---+----+---+

之后:

+---+---+
| x1| x3|
+---+---+
|  a|  c|
|  1|  0|
|  2|  3|
+---+---+

希望这会有所帮助！

这篇关于如何在PySpark DataFrame中删除具有空值的所有列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！