问题描述
下面是我的火花数据框
abc
1 3 4
2 0 0
4 1 0
2 2 0
我的输出应该如下
abc
1 3 4
2 0 2
4 1 -1
2 2 3
公式为 prev(c)-b + a
即 4-2 + 0 = 2
和 2-4 + 1 = -1
任何人都可以帮我渡过这个障碍吗? 来自pyspark.sql.functions的从pyspark.sql.types导入lag,udf
从pyspark.sql.window导入IntegerType
导入窗口
数字= [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF([ a),b','c'])
df.show()
$ bw = Window()。partitionBy()。orderBy('a')
calculate = udf(lambda a,b,c:a-b + c,IntegerType())
df = df.withColumn('result',lag(a)。over(w)-df.b + df .c)
df.show()
+ --- + --- + --- +
| A | C | ç|
+ --- + --- + --- +
| 1 | 2 | 3 |
| 2 | 3 | 4 |
| 3 | 4 | 5 |
| 5 | 6 | 7 |
+ --- + --- + --- +
+ --- + --- + --- + ------ +
| A | C | ç|结果|
+ --- + --- + --- + ------ +
| 1 | 2 | 3 |空|
| 2 | 3 | 4 | 2 |
| 3 | 4 | 5 | 3 |
| 5 | 6 | 7 | 4 |
+ --- + --- + --- + ------ +
Below is my spark data frame
a b c
1 3 4
2 0 0
4 1 0
2 2 0
My output should be as below
a b c
1 3 4
2 0 2
4 1 -1
2 2 3
Formula is prev(c)-b+a
i.e, 4-2+0=2
and 2-4+1=-1
Can anyone please help me to cross this hurdle?
from pyspark.sql.functions import lag, udf
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window
numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF(['a','b','c'])
df.show()
w = Window().partitionBy().orderBy('a')
calculate = udf(lambda a,b,c:a-b+c,IntegerType())
df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 5| 6| 7|
+---+---+---+
+---+---+---+------+
| a| b| c|result|
+---+---+---+------+
| 1| 2| 3| null|
| 2| 3| 4| 2|
| 3| 4| 5| 3|
| 5| 6| 7| 4|
+---+---+---+------+
这篇关于pyspark动态列计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!