This question already has answers here:
Remove all rows that are duplicates with respect to some rows
                                
                                    (2个答案)
                                
                        
                        
                            Keep only duplicates from a DataFrame regarding some field
                                
                                    (3个答案)
                                
                        
                                上个月关闭。
            
                    
我为此使用pysaprk:

在应用dropduplicates时,我想删除匹配行的所有出现。

数据集:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   1|   A|
|   1|   1|   A|
|   2|   1|   C|
|   1|   2|   D|
|   3|   5|   E|
|   3|   5|   E|
|   4|   3|   G|
+----+----+----+


我需要的 :

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   1|   C|
|   1|   2|   D|
|   4|   3|   G|
+----+----+----+


我尝试使用唯一,但唯一适用于所有列。

diff_df = source_df.union(target_df).dropDuplicates(columns_list)

最佳答案

这不是一个优雅的方法,但给出了一个想法

>>> df = spark.createDataFrame([(1,25),(1,20),(1,20),(2,26)],['id','age'])

>>> df.show()
+---+---+
| id|age|
+---+---+
|  1| 25|
|  1| 20|
|  1| 20|
|  2| 26|
+---+---+

>>> df.groupBy([c for c in df.columns]).agg(count('id').alias('c')).show()
+---+---+---+
| id|age|  c|
+---+---+---+
|  1| 25|  1|
|  1| 20|  2|
|  2| 26|  1|
+---+---+---+

>>> df.groupBy([c for c in df.columns]).agg(count('id').alias('c')).filter('c=1').show()
+---+---+---+
| id|age|  c|
+---+---+---+
|  1| 25|  1|
|  2| 26|  1|
+---+---+---+

关于python - Pyspark:如何仅删除少数列的dropduplicates中的两个事件,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58857277/

10-12 23:04