Pyspark drop_duplicates(keep = False)

本文介绍了Pyspark drop_duplicates(keep = False)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要熊猫drop_duplicates(keep=False)的 Pyspark 解决方案.不幸的是，keep=False选项在pyspark中不可用...

i need a Pyspark solution for Pandas drop_duplicates(keep=False). Unfortunately, the keep=False option is not available in pyspark...

熊猫示例:

import pandas as pd

df_data = {'A': ['foo', 'foo', 'bar'], 
         'B': [3, 3, 5],
         'C': ['one', 'two', 'three']}
df = pd.DataFrame(data=df_data)
df = df.drop_duplicates(subset=['A', 'B'], keep=False)
print(df)

预期输出:

     A  B       C
2  bar  5  three

转换.to_pandas()并返回到pyspark并不是一种选择.

A conversion .to_pandas() and back to pyspark is not an option.

谢谢！

推荐答案

使用窗口函数计算每个A / B组合的行数，然后过滤结果以仅保留唯一的行:

Use window function to count the number of rows for each A / B combination, and then filter the result to keep only rows that are unique:

import pyspark.sql.functions as f

df.selectExpr(
  '*', 
  'count(*) over (partition by A, B) as cnt'
).filter(f.col('cnt') == 1).drop('cnt').show()

+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+

或使用pandas_udf的另一个选项:

from pyspark.sql.functions import pandas_udf, PandasUDFType

# keep_unique returns the data frame if it has only one row, otherwise 
# drop the group
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def keep_unique(df):
    return df.iloc[:0] if len(df) > 1 else df

df.groupBy('A', 'B').apply(keep_unique).show()
+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+

这篇关于Pyspark drop_duplicates(keep = False)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

PySpark