问题描述
我有一个包含多列的 Spark 数据框.我想在数据框中添加一列,该列是一定数量的列的总和.
I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
例如,我的数据如下所示:
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
我想要添加一列,汇总特定列的行:
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
我知道如果您知道要添加的特定列,则可以将列添加在一起:
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
但是是否可以传递列名列表并将它们添加在一起?基于这个答案,这基本上是我想要的,但它使用的是 python API 而不是 Scala (在 PySpark 数据框中添加列总和作为新列)我认为这样的事情会起作用:
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
这会抛出错误值 sum is not a member of org.apache.spark.sql.DataFrame.有没有办法跨列求和?
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
预先感谢您的帮助.
推荐答案
您应该尝试以下操作:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
那么结果是:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
这篇关于在 Spark Dataframe 中的列列表中添加一列 rowsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!