scala - 如何在Spark SQL的DataFrame中更改列类型？

假设我正在做类似的事情:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment
1997 Ford  E350  Go get one now th...

但我确实希望将year用作Int(并可能转换其他一些列)。

我能想到的最好的是

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

这有点令人费解。

我来自R，而且我习惯于写，例如

df2 <- df %>%
   mutate(year = year %>% as.integer,
          make = make %>% toupper)

我可能会丢失一些东西，因为应该在Spark / Scala中有更好的方法来做到这一点...

最佳答案

编辑:最新版本

从spark 2.x开始，您可以使用.withColumn。在这里检查文档:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

最早的答案

从Spark版本1.4开始，您可以在列上应用带有DataType的强制转换方法:

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
    .drop("year")
    .withColumnRenamed("yearTmp", "year")

如果您使用的是SQL表达式，则还可以执行以下操作:

val df2 = df.selectExpr("cast(year as int) year",
                        "make",
                        "model",
                        "comment",
                        "blank")

有关更多信息，请检查文档:
http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

关于scala - 如何在Spark SQL的DataFrame中更改列类型？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/29383107/