Pyspark - 计算 groupby 的实际值和预测值之间的 RMSE - AssertionError:所有表达式都应该是列

本文介绍了Pyspark - 计算 groupby 的实际值和预测值之间的 RMSE - AssertionError:所有表达式都应该是列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个函数可以计算整个数据帧的预测值和实际值的 RMSE:

I have a function that calculates RMSE for the preds and actuals of an entire dataframe:

def calculate_rmse(df, actual_column, prediction_column):
    RMSE = F.udf(lambda x, y: ((x - y) ** 2))
    df = df.withColumn(
        "RMSE", RMSE(F.col(actual_column), F.col(prediction_column))
    )
    rmse = df.select(F.avg("RMSE") ** 0.5).collect()
    rmse = rmse[0]["POWER(avg(RMSE), 0.5)"]
    return rmse

test = calculate_rmse(my_df, 'actuals', 'preds')

3690.4535

我想将此应用于 groupby 语句，但是当我这样做时，我得到以下信息:

I would like to apply this to a groupby statement, but when I do, I get the following:

df_gb = my_df.groupby('start_month', 'start_week').agg(calculate_rmse(my_df, 'actuals', 'preds'))


all exprs should be Column
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/group.py", line 113, in agg
    assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
AssertionError: all exprs should be Column

有人能指出我正确的方向吗?我对 Pyspark 还很陌生.

Could someone point me in the correct direction? I am fairly new to Pyspark.

推荐答案

如果你想按组计算 RMSE，我对你的问题

If you want to calculate RMSE by group, a slight adaptation of the solution I proposed to your question

import pyspark.sql.functions as psf

def compute_RMSE(expected_col, actual_col):

  rmse = old_df.withColumn("squarederror",
                           psf.pow(psf.col(actual_col) - psf.col(expected_col),
                                   psf.lit(2)
                           ))
  .groupby('start_month', 'start_week')
  .agg(psf.avg(psf.col("squarederror")).alias("mse"))
  .withColumn("rmse", psf.sqrt(psf.col("mse")))

  return(rmse)


compute_RMSE("col1", "col2")

这篇关于Pyspark - 计算 groupby 的实际值和预测值之间的 RMSE - AssertionError:所有表达式都应该是列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！