本文介绍了比较两个Spark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花数据帧1-:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |236 |431|169    |
|city 2|prod 1 |9/28/2017|358 |975|193    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
+------+-------+---------+----+---+-------+

火花数据帧2-:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |230 |430|160    |
|city 1|prod 4 |9/27/2017|350 |90 |190    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
|city 3|prod 4 |9/18/2017|230 |431|169    |
+------+-------+---------+----+---+-------+

请找出适用于上述给定火花数据框1和火花数据框2的以下条件的火花数据框.

Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,

  1. 已删除的记录
  2. 新记录
  3. 没有更改的记录
  4. 具有更改的记录

  1. Deleted Records
  2. New Records
  3. Records with no changes
  4. Records with changes

这里的关键是城市",产品",日期".

Here key of comprision are 'city', 'product', 'date'.

我们需要不使用Spark SQL的解决方案.

we need solution without using Spark SQL.

推荐答案

我不确定要查找已删除和已修改的记录,但可以使用除功能之外的其他方法来获取

I am not sure about finding the deleted and modified records but you can use except function to get the difference

df2.except(df1)

这将返回已在dataframe2中添加或修改的行或具有更改的记录.输出:

This returns the rows that has been added or modified in dataframe2 or record with changes.Output:

+------+-------+---------+----+---+-------+
|  city|product|     date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431|    169|
|city 1| prod 4|9/27/2017| 350| 90|    190|
|city 1| prod 3|9/9/2017 | 230|430|    160|
+------+-------+---------+----+---+-------+

您也可以尝试使用join和filter来获取更改和未更改的数据,作为

You can also try with join and filter to get the changed and unchanged data as

df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)

希望这会有所帮助!

这篇关于比较两个Spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 13:58