我具有以下格式的输入数据集:
+---+--------+----------+
| id| refId| timestamp|
+---+--------+----------+
| 1| null|1548944642|
| 1|29950529|1548937685|
| 2|27510720|1548944885|
| 2|27510720|1548943617|
+---+--------+----------+
需要使用以下转换逻辑添加新列
session
:如果
refId is null
,则会话值为true。如果
id and refId are unique
,则会话值为true。如果
id and refId are not unique
和`timestamp'大于上一行,则会话值为true。时间戳之间的差异也应大于60。+---+--------+-------+----------+
| id| refId|session| timestamp|
+---+--------+-------+----------+
| 1| null| true|1548944642|
| 1|29950529| true|1548937685|
| 2|27510720| false|1548943617|
| 2|27510720| true|1548944885|
+---+--------+-------+----------+
我能够分别处理1和3条件,但不能处理第二个条件。
`data.withColumn(“ session”,functions.when(data.col(“ refId”)。isNull(),true));
3。
WindowSpec w = Window.partitionBy("id, refid").orderBy(timestampDS.col("timestamp"));
functions.coalesce(timestampDS.col("timestamp").cast("long").$minus(functions.lag("timestamp", 1).over(w).cast("long")), functions.lit(0));
我的问题是如何满足第二个条件并一起实施所有三个转换。
最佳答案
我想说使用Spark SQL可以降低复杂性并轻松实现结果
df.createOrReplaceTempView("test")
spark.sql("select id,refId,timestamp,case when refId is null and id is not null then 'true' when id is not null and refId is not null and rank=1 then 'true' else 'false' end as session from (select id,refId,timestamp, rank() OVER (PARTITION BY id,refId ORDER BY timestamp DESC) as rank from test) c").show()
输出看起来像这样:
+---+--------+----------+-------+
| id| refId| timestamp|session|
+---+--------+----------+-------+
| 1| null|1548944642| true|
| 1|29950529|1548937685| true|
| 2|27510720|1548944885| true|
| 2|27510720|1548943617| false|
+---+--------+----------+-------+
关于java - Spark:具有数据框的复杂操作,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54747913/