问题描述
我有数据帧66列的处理(需好歹改变几乎每个列的值),所以我运行下面的语句
I have DataFrame with 66 columns to process (almost each column value needs to be changed someway) so I'm running following statement
val result = data.map(row=> (
modify(row.getString(row.fieldIndex("XX"))),
(...)
)
)
第66届为止列。
由于斯卡拉在这个版本有限制的22双最大的元组,我不能像执行此。
问题是,是否有任何解决方法吗?
所有线路运营后,我将其转换带有具体列名的df,
till 66th column.Since scala in this version has limit to max tuple of 22 pairs I cannot perform this like that.Question is, is there any workaround for it?After all line operations I'm converting it to df with specific column names
result.toDf("c1",...,"c66")
result.storeAsTempTable("someFancyResult")
修改功能仅仅是一个例子来说明我的观点
"modify" function is just an example to show my point
推荐答案
如果你做的是修改值的现有数据帧
这是更好地使用UDF代替通过RDD映射:
If all you do is modifying values from an existing DataFrame
it is better to use an UDF instead of mapping over a RDD:
import org.apache.spark.sql.functions.udf
val modifyUdf = udf(modify)
data.withColumn("c1", modifyUdf($"c1"))
如果由于某种原因,上述方法不适合你的需求,你可以做最简单的事情是重新创建数据帧
从 RDD [行]
。例如像这样的:
If for some reason above doesn't fit your needs the simplest thing you can do is to recreateDataFrame
from a RDD[Row]
. for example like this:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}
val result: RDD[Row] = data.map(row => {
val buffer = ArrayBuffer.empty[Any]
// Add value to buffer
buffer.append(modify(row.getAs[String]("c1")))
// ... repeat for other values
// Build row
Row.fromSeq(buffer)
})
// Create schema
val schema = StructType(Seq(
StructField("c1", StringType, false),
// ...
StructField("c66", StringType, false)
))
sqlContext.createDataFrame(result, schema)
这篇关于星火斯卡拉2.10元组的限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!