本文介绍了pySpark数据框中的累积乘积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下Spark DataFrame:
I have the following spark DataFrame:
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
+---+---+
我要创建另一个名为"c"
的列,其中包含"b"对"a"的累加乘积.产生的DataFrame应该看起来像:
I want to make another column named "c"
which contains the cumulative product of "b" over "a". The resulting DataFrame should look like:
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 1|
| 1| 2| 2|
| 1| 3| 6|
| 1| 4| 24|
+---+---+---+
这怎么办?
推荐答案
您必须设置订单列.在您的情况下,我使用了列"b"
You have to set an order column. In your case I used column 'b'
from pyspark.sql import functions as F, Window, types
from functools import reduce
from operator import mul
df = spark.createDataFrame([(1, 1), (1, 2), (1, 3), (1, 4), (1, 5)], ['a', 'b'])
order_column = 'b'
window = Window.orderBy(order_column)
expr = F.col('a') * F.col('b')
mul_udf = F.udf(lambda x: reduce(mul, x), types.IntegerType())
df = df.withColumn('c', mul_udf(F.collect_list(expr).over(window)))
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 1|
| 1| 2| 2|
| 1| 3| 6|
| 1| 4| 24|
| 1| 5|120|
+---+---+---+
这篇关于pySpark数据框中的累积乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!