本文介绍了Pyspark drop_duplicates(keep = False)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要熊猫drop_duplicates(keep=False)
的 Pyspark 解决方案.不幸的是,keep=False
选项在pyspark中不可用...
i need a Pyspark solution for Pandas drop_duplicates(keep=False)
. Unfortunately, the keep=False
option is not available in pyspark...
熊猫示例:
import pandas as pd
df_data = {'A': ['foo', 'foo', 'bar'],
'B': [3, 3, 5],
'C': ['one', 'two', 'three']}
df = pd.DataFrame(data=df_data)
df = df.drop_duplicates(subset=['A', 'B'], keep=False)
print(df)
预期输出:
A B C
2 bar 5 three
转换.to_pandas()
并返回到pyspark并不是一种选择.
A conversion .to_pandas()
and back to pyspark is not an option.
谢谢!
推荐答案
使用窗口函数计算每个A / B
组合的行数,然后过滤结果以仅保留唯一的行:
Use window function to count the number of rows for each A / B
combination, and then filter the result to keep only rows that are unique:
import pyspark.sql.functions as f
df.selectExpr(
'*',
'count(*) over (partition by A, B) as cnt'
).filter(f.col('cnt') == 1).drop('cnt').show()
+---+---+-----+
| A| B| C|
+---+---+-----+
|bar| 5|three|
+---+---+-----+
或使用pandas_udf
的另一个选项:
from pyspark.sql.functions import pandas_udf, PandasUDFType
# keep_unique returns the data frame if it has only one row, otherwise
# drop the group
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def keep_unique(df):
return df.iloc[:0] if len(df) > 1 else df
df.groupBy('A', 'B').apply(keep_unique).show()
+---+---+-----+
| A| B| C|
+---+---+-----+
|bar| 5|three|
+---+---+-----+
这篇关于Pyspark drop_duplicates(keep = False)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!