问题描述
假设我正在做类似的事情:
Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
但我确实希望将year
用作Int
(也许还可以转换其他一些列).
But I really wanted the year
as Int
(and perhaps transform some other columns).
我能想到的最好的是
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
这有点令人费解.
我来自R,而且我习惯于写作,例如
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
我可能会丢失一些东西,因为应该在Spark/Scala中有更好的方法来做到这一点...
I'm likely missing something, since there should be a better way to do this in Spark/Scala...
推荐答案
最新版本
从spark 2.x开始,您可以使用.withColumn
.在此处检查文档:
Newest version
Since spark 2.x you can use .withColumn
. Check the docs here:
从Spark版本1.4开始,您可以在列上应用具有DataType的强制转换方法:
Since Spark version 1.4 you can apply the cast method with DataType on the column:
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
如果您使用的是SQL表达式,也可以执行以下操作:
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
有关更多信息,请检查文档: http://spark.apache. org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame
For more info check the docs:http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame
这篇关于如何在Spark SQL的DataFrame中更改列类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!