如何使用withColumn计算列中的最大值?

本文介绍了如何使用withColumn计算列中的最大值?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试计算 Spark 1.6.1 中以下 DataFrame 的最大值:

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:

val df = sc.parallelize(Seq(1,2,3)).toDF("id")

第一种方法是选择最大值，它按预期工作:

A first approach would be to select the maximum value, and it works as expected:

df.select(max($"id")).show

第二种方法可能是使用 withColumn 如下:

The second approach could be to use withColumn as follows:

df.withColumn("max", max($"id")).show

但不幸的是它失败并显示以下错误消息:

But unfortunately it fails with the following error message:

org.apache.spark.sql.AnalysisException:表达式id"既不是存在于 group by 中，也不是聚合函数.添加到群组如果您不关心哪个值，则通过或包装在 first() (或 first_value)中你明白了.;

如何在没有任何 Window 或 groupBy 的情况下计算 withColumn 函数中的最大值?如果不可能，我该如何使用 Window 在这种特定情况下做到这一点?

How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?

推荐答案

正确的方法是将聚合计算为单独的查询并与实际结果相结合.与此处的许多答案中建议的窗口函数不同，它不需要 shuffle 到单个分区，并且适用于大型数据集.

The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.

可以使用单独的操作 withColumn 来完成:

It could be done withColumn using a separate action:

import org.apache.spark.sql.functions.{lit, max}

df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))

但使用显式更简洁:

import org.apache.spark.sql.functions.broadcast

df.crossJoin(broadcast(df.agg(max($"id") as "max")))

或隐式交叉连接:

spark.conf.set("spark.sql.crossJoin.enabled", true)

df.join(broadcast(df.agg(max($"id") as "max")))

这篇关于如何使用withColumn计算列中的最大值?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！