pyspark动态列计算

本文介绍了pyspark动态列计算的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是我的火花数据框

我的输出应该如下

公式为 prev（c）-b + a 即 4-2 + 0 = 2 和 2-4 + 1 = -1

任何人都可以帮我渡过这个障碍吗？来自pyspark.sql.functions的从pyspark.sql.types导入lag，udf 从pyspark.sql.window导入IntegerType 导入窗口数字= [[1,2,3]，[2,3,4]，[3,4,5]，[5,6,7]] df = sc.parallelize（numbers）.toDF（[ a），b'，'c']） df.show（） $ bw = Window（）。partitionBy（）。orderBy（'a'） calculate = udf（lambda a，b，c：a-b + c，IntegerType（）） df = df.withColumn（'result'，lag（a）。over（w）-df.b + df .c） df.show（） + --- + --- + --- + | A | C | ç| + --- + --- + --- + | 1 | 2 | 3 | | 2 | 3 | 4 | | 3 | 4 | 5 | | 5 | 6 | 7 | + --- + --- + --- + + --- + --- + --- + ------ + | A | C | ç|结果| + --- + --- + --- + ------ + | 1 | 2 | 3 |空| | 2 | 3 | 4 | 2 | | 3 | 4 | 5 | 3 | | 5 | 6 | 7 | 4 | + --- + --- + --- + ------ +

Below is my spark data frame

My output should be as below

Formula is prev(c)-b+a i.e, 4-2+0=2 and 2-4+1=-1

Can anyone please help me to cross this hurdle?

解决方案

from pyspark.sql.functions import lag, udf
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window

numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF(['a','b','c'])
df.show()

w = Window().partitionBy().orderBy('a')
calculate = udf(lambda a,b,c:a-b+c,IntegerType())
df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
df.show()



+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
|  3|  4|  5|
|  5|  6|  7|
+---+---+---+

+---+---+---+------+
|  a|  b|  c|result|
+---+---+---+------+
|  1|  2|  3|  null|
|  2|  3|  4|     2|
|  3|  4|  5|     3|
|  5|  6|  7|     4|
+---+---+---+------+

这篇关于pyspark动态列计算的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！