问题描述
火花数据帧1-:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |236 |431|169 |
|city 2|prod 1 |9/28/2017|358 |975|193 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
+------+-------+---------+----+---+-------+
火花数据帧2-:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |230 |430|160 |
|city 1|prod 4 |9/27/2017|350 |90 |190 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
|city 3|prod 4 |9/18/2017|230 |431|169 |
+------+-------+---------+----+---+-------+
请找出适用于上述给定火花数据框1和火花数据框2的以下条件的火花数据框.
Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,
- 已删除的记录
- 新记录
- 没有更改的记录
-
具有更改的记录
- Deleted Records
- New Records
- Records with no changes
Records with changes
这里的关键是城市",产品",日期".
Here key of comprision are 'city', 'product', 'date'.
我们需要不使用Spark SQL的解决方案.
we need solution without using Spark SQL.
推荐答案
我不确定要查找已删除和已修改的记录,但可以使用除功能之外的其他方法来获取
I am not sure about finding the deleted and modified records but you can use except function to get the difference
df2.except(df1)
这将返回已在dataframe2中添加或修改的行或具有更改的记录.输出:
This returns the rows that has been added or modified in dataframe2 or record with changes.Output:
+------+-------+---------+----+---+-------+
| city|product| date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431| 169|
|city 1| prod 4|9/27/2017| 350| 90| 190|
|city 1| prod 3|9/9/2017 | 230|430| 160|
+------+-------+---------+----+---+-------+
您也可以尝试使用join和filter来获取更改和未更改的数据,作为
You can also try with join and filter to get the changed and unchanged data as
df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)
希望这会有所帮助!
这篇关于比较两个Spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!