问题描述
如何使用 spark-ml
而不是 spark-mllib
处理分类数据?
How do I handle categorical data with spark-ml
and not spark-mllib
?
认为文档不是很清楚,似乎分类器例如RandomForestClassifier
,LogisticRegression
,有一个featuresCol
参数,指定DataFrame
中特征列的名称,和一个 labelCol
参数,它指定 DataFrame
中标记类的列的名称.
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier
, LogisticRegression
, have a featuresCol
argument, which specifies the name of the column of features in the DataFrame
, and a labelCol
argument, which specifies the name of the column of labeled classes in the DataFrame
.
显然我想在我的预测中使用多个特征,所以我尝试使用 VectorAssembler
将我的所有特征放在 featuresCol
下的单个向量中.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler
to put all my features in a single vector under featuresCol
.
然而,VectorAssembler
只接受数字类型、布尔类型和向量类型(根据 Spark 网站),所以我不能将字符串放入我的特征向量中.
However, the VectorAssembler
only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
我应该如何进行?
推荐答案
我只是想完成 Holden 的回答.
I just wanted to complete Holden's answer.
自 Spark 2.3.0 起,OneHotEncoder
已被弃用,并将在 3.0.0
中删除.请改用 OneHotEncoderEstimator
.
Since Spark 2.3.0,OneHotEncoder
has been deprecated and it will be removed in 3.0.0
. Please use OneHotEncoderEstimator
instead.
在 Scala 中:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
在 Python 中:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
自 Spark 1.4.0 起,MLLib 还提供 OneHotEncoder 功能,将一列标签索引映射到一列二元向量,最多只有一个值.
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
这种编码允许期望连续特征的算法,例如逻辑回归,使用分类特征
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
让我们考虑以下 DataFrame
:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
第一步是使用 StringIndexer
创建索引的 DataFrame
:
The first step would be to create the indexed DataFrame
with the StringIndexer
:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
然后您可以使用 OneHotEncoder
对 categoryIndex
进行编码:
You can then encode the categoryIndex
with OneHotEncoder
:
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
这篇关于如何使用 spark-ml 处理分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!