问题描述
DataFrame 1是我现在拥有的,我想编写一个Scala函数以使DataFrame 1看起来像DataFrame2.
DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2.
转移是大类;电子转帐和IMT是子类别.
Transfer is the big category; e-transfer and IMT are subcategories.
逻辑是,对于相同的ID(31898),如果同时为Transfer和e-Transfer都添加了标签,则只能是e-Transfer;如果Transfer和IMT和e-Transfer都标记了相同的ID(32614),则应为e-Transfer + IMT;如果仅将转移标记为一个ID(33987),则应为其他;如果仅将电子转帐或IMT标记为ID(34193),则应仅将其作为IMT的电子转帐.
The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a ID (34193), it should just be e-transfer pr IMT.
scala的新手,不知道如何编写一个好的函数来执行此操作.请帮忙!
New to scala, don't know how to write a good function to do this. Please help!!
DataFrame 1 DataFrame 2
+---------+-------------+ +---------+------------------+
| ID | Category | | ID | Category |
+---------+-------------+ +---------+------------------+
| 31898 | Transfer | | 31898 | e-Transfer |
| 31898 | e-Transfer | | 32614 | e-Transfer + IMT|
| 32614 | Transfer | =====> | 33987 | Other |
| 32614 | e-Transfer | =====> | 34193 | e-Transfer |
| 32614 | IMT | +---------+------------------+
| 33987 | Transfer |
| 34193 | e-Transfer |
+---------+-------------+
推荐答案
您可以使用 collect_set
通过 ID
对DataFrame进行分组,以汇总 Category
组装类别数组,并使用 array_contains
:
You can group the DataFrame by ID
to aggregate Category
using collect_set
to assemble arrays of categories, and create a new column based on content in the category arrays using array_contains
:
import org.apache.spark.sql.functions._
val df = Seq(
(31898, "Transfer"),
(31898, "e-Transfer"),
(32614, "Transfer"),
(32614, "e-Transfer"),
(32614, "IMT"),
(33987, "Transfer"),
(34193, "e-Transfer")
).toDF("ID", "Category")
df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
withColumn( "Category",
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
"e-Transfer + IMT").otherwise(
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
"e-Transfer").otherwise(
when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
$"CategorySet"(0)).otherwise(
when($"CategorySet" === Array("Transfer"), "Other")
)))
).
show(false)
// +-----+---------------------------+----------------+
// |ID |CategorySet |Category |
// +-----+---------------------------+----------------+
// |33987|[Transfer] |Other |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer] |e-Transfer |
// |31898|[Transfer, e-Transfer] |e-Transfer |
// +-----+---------------------------+----------------+
您的样本数据可能未涵盖所有情况(例如 [Transfer,MIT]
).现有的示例代码将为任何剩余情况生成 null
类别值.如果发现其他情况,只需修改/扩展条件检查.
Your sample data might not have covered all cases (e.g. [Transfer, MIT]
). The existing sample code would generate null
category value for any remaining cases. Simply modify/expand the conditional check if additional cases are identified.
这篇关于Scala-数据框的条件替换列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!