问题描述
我正在尝试对一列句子执行 StringIndexer 操作,即将单词列表转换为整数列表.
I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers.
例如:
输入数据集:
(1, ["I", "like", "Spark"])
(2, ["I", "hate", "Spark"])
我希望 StringIndexer 之后的输出是这样的:
I expected the output after StringIndexer to be like:
(1, [0, 2, 1])
(2, [0, 3, 1])
理想情况下,我希望将这种转换作为 Pipeline 的一部分进行,以便我可以将 Transformer 耦合在一起并进行序列化以进行在线服务.
Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving.
这是 Spark 本身支持的东西吗?
Is this something Spark support natively?
谢谢!
推荐答案
用于将文本转换为特征的标准 Transformers
是 CountVectorizer
Standard Transformers
used for converting text to features are CountVectorizer
CountVectorizer 和 CountVectorizerModel 旨在帮助将文本文档集合转换为标记计数向量.
或 HashingTF
:
使用散列技巧将一系列术语映射到它们的术语频率.目前我们使用 Austin Appleby 的 MurmurHash 3 算法(MurmurHash3_x86_32)来计算术语对象的哈希码值.由于使用简单的模将哈希函数转换为列索引,因此建议使用 2 的幂作为 numFeatures 参数;否则特征将不会被均匀地映射到列.
两者都有 binary
选项,可用于从计数切换到二进制向量.
Both have binary
option which can used to switch from count to binary vector.
没有内置的 Transfomer
可以给出你想要的准确结果(它对 ML 算法没有用)买你可以 explode
应用 StringIndexer
和 collect_list
/collect_set
:
There is no builtin Transfomer
that can give exact result you want (it wouldn't be useful for ML algorithms) buy you can explode
apply StringIndexer
, and collect_list
/ collect_set
:
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
val df = Seq(
(1, Array("I", "like", "Spark")), (2, Array("I", "hate", "Spark"))
).toDF("id", "words")
val pipeline = new Pipeline().setStages(Array(
new SQLTransformer()
.setStatement("SELECT id, explode(words) as word FROM __THIS__"),
new StringIndexer().setInputCol("word").setOutputCol("index"),
new SQLTransformer()
.setStatement("""SELECT id, COLLECT_SET(index) AS values
FROM __THIS__ GROUP BY id""")
))
pipeline.fit(df).transform(df).show
// +---+---------------+
// | id| values|
// +---+---------------+
// | 1|[0.0, 1.0, 3.0]|
// | 2|[2.0, 0.0, 1.0]|
// +---+---------------+
使用 CountVectorizer
和 udf
:
import org.apache.spark.ml.linalg._
spark.udf.register("indices", (v: Vector) => v.toSparse.indices)
val pipeline = new Pipeline().setStages(Array(
new CountVectorizer().setInputCol("words").setOutputCol("vector"),
new SQLTransformer()
.setStatement("SELECT *, indices(vector) FROM __THIS__")
))
pipeline.fit(df).transform(df).show
// +---+----------------+--------------------+-------------------+
// | id| words| vector|UDF:indices(vector)|
// +---+----------------+--------------------+-------------------+
// | 1|[I, like, Spark]|(4,[0,1,3],[1.0,1...| [0, 1, 3]|
// | 2|[I, hate, Spark]|(4,[0,1,2],[1.0,1...| [0, 1, 2]|
// +---+----------------+--------------------+-------------------+
这篇关于Spark:句子上的 StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!