本文介绍了Pyspark - 计算 groupby 的实际值和预测值之间的 RMSE - AssertionError:所有表达式都应该是列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个函数可以计算整个数据帧的预测值和实际值的 RMSE:
I have a function that calculates RMSE for the preds and actuals of an entire dataframe:
def calculate_rmse(df, actual_column, prediction_column):
RMSE = F.udf(lambda x, y: ((x - y) ** 2))
df = df.withColumn(
"RMSE", RMSE(F.col(actual_column), F.col(prediction_column))
)
rmse = df.select(F.avg("RMSE") ** 0.5).collect()
rmse = rmse[0]["POWER(avg(RMSE), 0.5)"]
return rmse
test = calculate_rmse(my_df, 'actuals', 'preds')
3690.4535
我想将此应用于 groupby
语句,但是当我这样做时,我得到以下信息:
I would like to apply this to a groupby
statement, but when I do, I get the following:
df_gb = my_df.groupby('start_month', 'start_week').agg(calculate_rmse(my_df, 'actuals', 'preds'))
all exprs should be Column
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/group.py", line 113, in agg
assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
AssertionError: all exprs should be Column
有人能指出我正确的方向吗?我对 Pyspark 还很陌生.
Could someone point me in the correct direction? I am fairly new to Pyspark.
推荐答案
如果你想按组计算 RMSE,我对 你的问题
If you want to calculate RMSE by group, a slight adaptation of the solution I proposed to your question
import pyspark.sql.functions as psf
def compute_RMSE(expected_col, actual_col):
rmse = old_df.withColumn("squarederror",
psf.pow(psf.col(actual_col) - psf.col(expected_col),
psf.lit(2)
))
.groupby('start_month', 'start_week')
.agg(psf.avg(psf.col("squarederror")).alias("mse"))
.withColumn("rmse", psf.sqrt(psf.col("mse")))
return(rmse)
compute_RMSE("col1", "col2")
这篇关于Pyspark - 计算 groupby 的实际值和预测值之间的 RMSE - AssertionError:所有表达式都应该是列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!