本文介绍了两个数据帧相等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下情况:

我有2个仅包含1列的数据框

I have 2 dataframes containing only 1 column Lets say

DF1=(1,2,3,4,5)
DF2=(3,6,7,8,9,10)

基本上这些值是键,如果DF1中的键不在DF2中,则我正在创建DF1的镶木文件(在当前示例中,它应该返回false) 。我目前达到要求的方式是:

Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is:

val df1count= DF1.count
val df2count=DF2.count
val diffDF=DF2.except(DF1)
val diffCount=diffDF.count
if(diffCount==(df2count-df1count)) true
else false

这种方法的问题是我将动作元素调用4次,这肯定不是最好的方法。有人可以建议我实现这一目标的最佳有效方法吗?

The problem with this approach is I am calling action elements 4 times which is for sure not the best way. Can someone suggest me the best effective way of achieving this?

推荐答案

这是一种获取两个数据帧之间不常见行的方法:

Here is a way to to get the uncommon rows between two dataframes:

val d1 = Seq((3, "Chennai", "rahman", "9848022330", 45000, "SanRamon"), (1, "Hyderabad", "ram", "9848022338", 50000, "SF"), (2, "Hyderabad", "robin", "9848022339", 40000, "LA"), (4, "sanjose", "romin", "9848022331", 45123, "SanRamon"))
val d2 = Seq((3, "Chennai", "rahman", "9848022330", 45000, "SanRamon"), (1, "Hyderabad", "ram", "9848022338", 50000, "SF"), (2, "Hyderabad", "robin", "9848022339", 40000, "LA"), (4, "sanjose", "romin", "9848022331", 45123, "SanRamon"), (4, "sanjose", "romino", "9848022331", 45123, "SanRamon"), (5, "LA", "Test", "1234567890", 12345, "Testuser"))

val df1 = d1.toDF("emp_id" ,"emp_city" ,"emp_name" ,"emp_phone" ,"emp_sal" ,"emp_site")
val df2 = d2.toDF("emp_id" ,"emp_city" ,"emp_name" ,"emp_phone" ,"emp_sal" ,"emp_site")

spark.sql("((select * from df1) union (select * from df2)) minus ((select * from df1) intersect (select * from df2))").show //spark is SparkSession

这篇关于两个数据帧相等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 11:33