Scala-数据框的条件替换列值

本文介绍了Scala-数据框的条件替换列值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

DataFrame 1是我现在拥有的，我想编写一个Scala函数以使DataFrame 1看起来像DataFrame2.

DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2.

转移是大类；电子转帐和IMT是子类别.

Transfer is the big category; e-transfer and IMT are subcategories.

逻辑是，对于相同的ID(31898)，如果同时为Transfer和e-Transfer都添加了标签，则只能是e-Transfer；如果Transfer和IMT和e-Transfer都标记了相同的ID(32614)，则应为e-Transfer + IMT；如果仅将转移标记为一个ID(33987)，则应为其他；如果仅将电子转帐或IMT标记为ID(34193)，则应仅将其作为IMT的电子转帐.

The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a ID (34193), it should just be e-transfer pr IMT.

scala的新手，不知道如何编写一个好的函数来执行此操作.请帮忙！

New to scala, don't know how to write a good function to do this. Please help!!

DataFrame 1                        DataFrame 2
+---------+-------------+          +---------+------------------+
|   ID    | Category    |          |   ID    | Category         |
+---------+-------------+          +---------+------------------+  
|  31898  |   Transfer  |          |  31898  |  e-Transfer      |  
|  31898  |  e-Transfer |          |  32614  |  e-Transfer + IMT|
|  32614  |   Transfer  |  =====>  |  33987  |   Other          |
|  32614  |  e-Transfer |  =====>  |  34193  |  e-Transfer      |
|  32614  |     IMT     |          +---------+------------------+
|  33987  |   Transfer  |  
|  34193  |  e-Transfer |  
+---------+-------------+

推荐答案

您可以使用 collect_set 通过 ID 对DataFrame进行分组，以汇总 Category 组装类别数组，并使用 array_contains :

You can group the DataFrame by ID to aggregate Category using collect_set to assemble arrays of categories, and create a new column based on content in the category arrays using array_contains:

import org.apache.spark.sql.functions._

val df = Seq(
  (31898, "Transfer"),
  (31898, "e-Transfer"),
  (32614, "Transfer"),
  (32614, "e-Transfer"),
  (32614, "IMT"),
  (33987, "Transfer"),
  (34193, "e-Transfer")
).toDF("ID", "Category")

df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
  withColumn( "Category",
    when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
      "e-Transfer + IMT").otherwise(
    when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
      "e-Transfer").otherwise(
    when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
      $"CategorySet"(0)).otherwise(
    when($"CategorySet" === Array("Transfer"), "Other")
    )))
  ).
  show(false)
// +-----+---------------------------+----------------+
// |ID   |CategorySet                |Category        |
// +-----+---------------------------+----------------+
// |33987|[Transfer]                 |Other           |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer]               |e-Transfer      |
// |31898|[Transfer, e-Transfer]     |e-Transfer      |
// +-----+---------------------------+----------------+

您的样本数据可能未涵盖所有情况(例如 [Transfer，MIT] ).现有的示例代码将为任何剩余情况生成 null 类别值.如果发现其他情况，只需修改/扩展条件检查.

Your sample data might not have covered all cases (e.g. [Transfer, MIT]). The existing sample code would generate null category value for any remaining cases. Simply modify/expand the conditional check if additional cases are identified.

这篇关于Scala-数据框的条件替换列值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！