本文介绍了Scala spark - 处理层次结构数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有带有树结构的层次结构数据模型的数据表.例如:这是一个示例数据行:
I have data table with hierarchy data model with tree structures.For example:Here is a sample data row:
-------------------------------------------
Id | name |parentId | path | depth
-------------------------------------------
55 | Canada | null | null | 0
77 | Ontario | 55 | /55 | 1
100| Toronto | 77 | /55/77 | 2
104| Brampton| 100 | /55/77/100 | 3
我希望将这些行转换为展平版本,示例输出为:
I am looking to convert those rows into flattening version, sample output would be:
-----------------------------------
Id | name | parentId | depth
------------------------------------
104| Brampton | Toronto | 3
100| Toronto | Ontario | 2
77 | Ontario | Canada | 1
55 | Canada | None | 0
100| Toronto | Ontario | 2
77 | Ontario | Canada | 1
55 | Canada | None | 0
77 | Ontario | Canada | 1
55 | Canada | None | 0
55 | Canada | None | 0
我尝试使用笛卡尔或像 n2 搜索一样,但它们都不起作用.
I tried using cartesian or do like n2 search but none of them are working.
推荐答案
以下是一种方法:
//Creating DF with your data
def getSeq(s:String): Seq[String] = { s.split('|').map(_.trim).toSeq }
var l = getSeq("77 | Ontario | 55 | /55 | 1") :: Nil
l :+= getSeq("55 | Canada | null | null | 0")
l :+= getSeq("100| Toronto | 77 | /55/77 | 2")
l :+= getSeq("104| Brampton| 100 | /55/77/100 | 3")
val df = l.map(x => x match { case Seq(a,b,c,d,e) => (a,b,c,d,e) }).toDF("Id", "name", "parentId", "path", "depth")
//original DF with parentName using a self join
val dfWithPar = df.as("df1").join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").select($"df1.Id",$"df1.name",$"df1.parentId",$"df1.path",$"df1.depth",$"df2.name".as("parentName"))
// Split path as per requirement and get the exploded DF
val dfExploded = dfWithPar.withColumn("path", regexp_replace($"path", "^/", "")).withColumn("path", split($"path","/")).withColumn("path", explode($"path"))
//Join orig with exploded to get addendum of rows as per individual path placeholders
val dfJoined = dfWithPar.join(dfExploded, dfWithPar.col("Id") === dfExploded.col("path")).select(dfWithPar.col("Id"), dfWithPar.col("name"), dfWithPar.col("parentId"), dfWithPar.col("path"), dfWithPar.col("depth"), dfWithPar.col("parentName"))
//Get the final result by adding the addendum to orig
dfWithPar.union(dfJoined).select($"Id", $"name", $"parentName", $"depth").show
+---+--------+----------+-----+
| Id| name|parentName|depth|
+---+--------+----------+-----+
| 77| Ontario| Canada| 1|
| 55| Canada| null| 0|
|100| Toronto| Ontario| 2|
|104|Brampton| Toronto| 3|
| 77| Ontario| Canada| 1|
| 77| Ontario| Canada| 1|
| 55| Canada| null| 0|
| 55| Canada| null| 0|
| 55| Canada| null| 0|
|100| Toronto| Ontario| 2|
+---+--------+----------+-----+
这篇关于Scala spark - 处理层次结构数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!