问题描述
我有一个spark数据框,其中有一列短句子,以及一列带有分类变量.我想对句子执行tf-idf
,对分类变量执行one-hot-encoding
,然后将其输出到驱动程序上的稀疏矩阵中,一旦它的大小变得小得多(对于scikit-learn模型).
I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf
on the sentences, one-hot-encoding
on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model).
以稀疏形式获取数据的最佳方法是什么?稀疏向量上似乎只有一个toArray()
方法,可以输出numpy
数组.但是,文档确实说scipy稀疏数组可以代替火花稀疏数组.
What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray()
method on sparse vectors, which outputs numpy
arrays. However, the docs do say that scipy sparse arrays can be used in the place of spark sparse arrays.
还请记住,tf_idf值实际上是一列稀疏数组.理想情况下,将所有这些功能整合到一个大的稀疏矩阵中会很好.
Keep in mind also that the tf_idf values are in fact a column of sparse arrays. Ideally it would be nice to get all these features into one large sparse matrix.
推荐答案
一种可能的解决方案可以表示为:
One possible solution can be expressed as follows:
-
将特征转换为
RDD
并提取向量:
from pyspark.ml.linalg import SparseVector
from operator import attrgetter
df = sc.parallelize([
(SparseVector(3, [0, 2], [1.0, 3.0]), ),
(SparseVector(3, [1], [4.0]), )
]).toDF(["features"])
features = df.rdd.map(attrgetter("features"))
添加行索引:
add row indices:
indexed_features = features.zipWithIndex()
展平为元组(i, j, value)
的RDD:
flatten to RDD of tuples (i, j, value)
:
def explode(row):
vec, i = row
for j, v in zip(vec.indices, vec.values):
yield i, j, v
entries = indexed_features.flatMap(explode)
收集并重塑:
collect and reshape:
row_indices, col_indices, data = zip(*entries.collect())
计算形状:
compute shape:
shape = (
df.count(),
df.rdd.map(attrgetter("features")).first().size
)
创建稀疏矩阵:
create sparse matrix:
from scipy.sparse import csr_matrix
mat = csr_matrix((data, (row_indices, col_indices)), shape=shape)
快速健全性检查:
quick sanity check:
mat.todense()
预期结果:
matrix([[ 1., 0., 3.],
[ 0., 4., 0.]])
另一个:
-
将
features
的每一行转换为矩阵:
convert each row of
features
to matrix:
import numpy as np
def as_matrix(vec):
data, indices = vec.values, vec.indices
shape = 1, vec.size
return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
mats = features.map(as_matrix)
并用vstack
减小:
from scipy.sparse import vstack
mat = mats.reduce(lambda x, y: vstack([x, y]))
或collect
和vstack
mat = vstack(mats.collect())
这篇关于pyspark:稀疏向量到稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!