给定以下示例数据框:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
我需要对金额和max_total_advertiser字段求和,以便在每一行中获得正确的总价值。考虑到我需要该总值用于按Advertiser_id划分的每个组。 (初始数据框中的总列不正确,这就是为什么我要正确计算的原因)
这样的事情应该是:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
此
lag("total_aux")
不起作用,因为尚未生成该列,这是我要实现的目标,如果它是组中的第一行,则如果未将先前获得的值与当前数量相加,则将同一行中的列相加领域。输出示例:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
谢谢。
最佳答案
我假设name
是每个advertiser_id
的不同值,因此您的数据集可以按name
排序。我还假定max_total_advertiser
对于每个advertiser_id
包含相同的值。如果不是其中一种,请添加评论。
您需要的是rangeBetween窗口,该窗口为您提供指定范围内的所有前排和后排。我们将使用Window.unboundedPreceding
来汇总所有先前的值。
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
输出:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+
关于python - PySpark两个值的总和,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56911888/