本文介绍了从压缩数据列表中创建一个非常大的稀疏矩阵csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下格式的字典:

I have a dictionary of the format:

{
  "sample1": set(["feature1", "feature2", "feature3"]),
  "sample2": set(["feature1", "feature4", "feature5"]),
}

我有2000万个sample和15万个独特功能.

where I have 20M samples and 150K unique features.

我想将其转换为以下格式的csv:

I want to convert this into a csv of the format:

sample,feature1,feature2,feature3,feature4,feature5
sample1,1,1,1,0,0
sample2,1,0,0,1,1

到目前为止我所做的:

  1. ALL_FEATURES = list(set(features))
  1. ALL_FEATURES = list(set(features))
with open("features.csv", "w") as f:
    f.write("fvecmd5," + ",".join([str(x) for x in ALL_FEATURES]) + "\n")
    fvecs_lol = list(fvecs.items())
    fvecs_keys, fvecs_values = zip(*fvecs_lol)
    del fvecs_lol
    tmp = [["1" if feature in featurelist else "0" for feature in ALL_FEATURES] for featurelist in fvecs_values]
    for i, entry in enumerate(tmp):
        f.write(fvecs_keys[i] + "," + ",".join(entry) + "\n")

但是运行速度很慢.有更快的方法吗?也许利用Numpy/Cython?

But this is running very slow. Are there faster ways? Maybe leveraging Numpy/Cython?

推荐答案

您可以使用 sklearn.feature_extraction.text.CountVectorizer ,它会生成一个稀疏矩阵,然后创建一个SparseDataFrame:

You can use sklearn.feature_extraction.text.CountVectorizer, which produces a sparse matrix and then create a SparseDataFrame:

In [49]: s = pd.SparseSeries(d).astype(str).str.replace(r"[{,'}]",'')

In [50]: s
Out[50]:
sample1    feature1 feature2 feature3
sample2    feature1 feature5 feature4
dtype: object

In [51]: from sklearn.feature_extraction.text import CountVectorizer

In [52]: cv = CountVectorizer()

In [53]: r = pd.SparseDataFrame(cv.fit_transform(s),
                                s.index, 
                                cv.get_feature_names(), 
                                default_fill_value=0)

In [54]: r
Out[54]:
         feature1  feature2  feature3  feature4  feature5
sample1         1         1         1         0         0
sample2         1         0         0         1         1

这篇关于从压缩数据列表中创建一个非常大的稀疏矩阵csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 01:24