问题描述
我正在将 CSV 文件(使用 spark-csv)导入到具有空 String
值的 DataFrame
中.当应用 OneHotEncoder
时,应用程序崩溃并出现错误 requirement failed: Cannot have a empty string for name.
.有没有办法解决这个问题?
I am importing a CSV file (using spark-csv) into a DataFrame
which has empty String
values. When applied the OneHotEncoder
, the application crashes with error requirement failed: Cannot have an empty string for name.
. Is there a way I can get around this?
我可以重现 Spark ml 上提供的示例中的错误 页面:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
这很烦人,因为缺失值/空值是一种高度通用的情况.
It is annoying since missing/empty values is a highly generic case.
提前致谢,尼基尔
推荐答案
由于 OneHotEncoder
/OneHotEncoderEstimator
不接受空字符串作为名称,否则你会得到以下错误:
Since the OneHotEncoder
/OneHotEncoderEstimator
does not accept empty string for name, or you'll get the following error :
java.lang.IllegalArgumentException:要求失败:名称不能有空字符串.在 scala.Predef$.require(Predef.scala:233)在 org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)在 org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)[...]
这就是我要做的:(还有其他方法可以做到,rf.@Anthony 的回答)
This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)
我将创建一个 UDF
来处理空类别:
I'll create an UDF
to process the empty category :
import org.apache.spark.sql.functions._
def processMissingCategory = udf[String, String] { s => if (s == "") "NA" else s }
然后,我将在列上应用 UDF:
Then, I'll apply the UDF on the column :
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
.withColumn("category",processMissingCategory('category))
df.show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
现在,您可以返回到您的转换
Now, you can go back to your transformations
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| NA| 3.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)
encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
@Anthony 在 Scala 中的解决方案:
@Anthony 's solution in Scala :
df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
我希望这会有所帮助!
这篇关于Spark DataFrame 在 OneHotEncoder 中处理空字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!