问题描述
我有以下显示购买收入的数据框.
I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
最终我希望新列 purch_revenue
在每一行中显示购买产生的收入.作为一种解决方法,我还尝试引入购买标识符 purch_id
,每次购买时都会增加该标识符.所以这只是作为参考列出.
Ultimately I want the new column purch_revenue
to show the revenue generated by the purchase in every row.As a workaround, I have also tried to introduce a purchase identifier purch_id
which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
我试过像这样使用 lag/lead
函数:
I've tried to use the lag/lead
function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
如果 revenue >,这将复制收入列.0
并将其拉高一行.显然,我可以将其链接到有限的 N,但这不是解决方案.
This duplicates the revenue column if revenue > 0
and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
- 有没有办法递归地应用这个直到
revenue >0
? - 或者,有没有办法根据条件增加值?我试图找出一种方法来做到这一点,但很难找到.
推荐答案
窗口函数不支持递归,但这里不需要.这种类型的分离可以通过累积和轻松处理:
Window functions don't support recursion but it is not required here. This type of sesionization can be easily handled with cumulative sum:
from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window
w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
col("revenue") > 0, 1).otherwise(0),
1, 0
).over(w)).over(w) + 1
df.withColumn("purch_id", purch_id).show()
+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
| 1| 1| 0| 1|
| 1| 2| 0| 1|
| 1| 3| 0| 1|
| 1| 4| 100| 1|
| 1| 5| 0| 2|
| 1| 6| 0| 2|
| 1| 7| 200| 2|
| 1| 8| 0| 3|
| 1| 9| 10| 3|
+-------+--------+-------+--------+
这篇关于Spark - 带递归的窗口?- 有条件地跨行传播值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!