问题描述
我遇到的问题与,但我需要在pyspark中解决方案,并且不要轻拂.
I have the same problem as asked here but I need a solution in pyspark and without breeze.
例如,如果我的pyspark数据框如下所示:
For example if my pyspark dataframe look like this:
user | weight | vec
"u1" | 0.1 | [2, 4, 6]
"u1" | 0.5 | [4, 8, 12]
"u2" | 0.5 | [20, 40, 60]
其中列权重的类型为double,列vec的类型为Array [Double],我想获取每个用户的向量的加权总和,以便获得如下所示的数据框:
where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this:
user | wsum
"u1" | [2.2, 4.4, 6.6]
"u2" | [10, 20, 30]
为此,我尝试了以下方法:
To do this I have tried the following:
df.groupBy('user').agg((F.sum(df.vec* df.weight)).alias("wsum"))
但是它失败了,因为vec列和weight列具有不同的类型.
But it failed as the vec column and weight columns have different types.
如何轻而易举地解决此错误?
How can I solve this error without breeze?
推荐答案
正在使用Spark 2.4提供的高阶函数transform
:
On way using higher-order function transform
availiable from Spark 2.4:
# get size of vec array
n = df.select(size("vec")).first()[0]
# transform each element of the vec array
transform_expr = "transform(vec, x -> x * weight)"
df.withColumn("weighted_vec", expr(transform_expr)) \
.groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum"))\
.show()
赠予:
+----+------------------+
|user| wsum|
+----+------------------+
| u1| [2.2, 4.4, 6.6]|
| u2|[10.0, 20.0, 30.0]|
+----+------------------+
对于Spark< 2.4,使用for理解将每个元素乘以weight
列,如下所示:
For Spark < 2.4, using a for comprehension to multiply each element by the weight
column like this:
df.withColumn("weighted_vec", array(*[col("vec")[i] * col("weight") for i in range(n)])) \
.groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum")) \
.show()
这篇关于轻而易举地将两个具有不同类型的pyspark数据框列(数组[double]与double)相乘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!