本文介绍了从压缩数据列表中创建一个非常大的稀疏矩阵csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下格式的字典:
I have a dictionary of the format:
{
"sample1": set(["feature1", "feature2", "feature3"]),
"sample2": set(["feature1", "feature4", "feature5"]),
}
我有2000万个sample
和15万个独特功能.
where I have 20M sample
s and 150K unique features.
我想将其转换为以下格式的csv:
I want to convert this into a csv of the format:
sample,feature1,feature2,feature3,feature4,feature5
sample1,1,1,1,0,0
sample2,1,0,0,1,1
到目前为止我所做的:
-
ALL_FEATURES = list(set(features))
ALL_FEATURES = list(set(features))
with open("features.csv", "w") as f:
f.write("fvecmd5," + ",".join([str(x) for x in ALL_FEATURES]) + "\n")
fvecs_lol = list(fvecs.items())
fvecs_keys, fvecs_values = zip(*fvecs_lol)
del fvecs_lol
tmp = [["1" if feature in featurelist else "0" for feature in ALL_FEATURES] for featurelist in fvecs_values]
for i, entry in enumerate(tmp):
f.write(fvecs_keys[i] + "," + ",".join(entry) + "\n")
但是运行速度很慢.有更快的方法吗?也许利用Numpy/Cython?
But this is running very slow. Are there faster ways? Maybe leveraging Numpy/Cython?
推荐答案
您可以使用 sklearn.feature_extraction.text.CountVectorizer ,它会生成一个稀疏矩阵,然后创建一个SparseDataFrame:
You can use sklearn.feature_extraction.text.CountVectorizer, which produces a sparse matrix and then create a SparseDataFrame:
In [49]: s = pd.SparseSeries(d).astype(str).str.replace(r"[{,'}]",'')
In [50]: s
Out[50]:
sample1 feature1 feature2 feature3
sample2 feature1 feature5 feature4
dtype: object
In [51]: from sklearn.feature_extraction.text import CountVectorizer
In [52]: cv = CountVectorizer()
In [53]: r = pd.SparseDataFrame(cv.fit_transform(s),
s.index,
cv.get_feature_names(),
default_fill_value=0)
In [54]: r
Out[54]:
feature1 feature2 feature3 feature4 feature5
sample1 1 1 1 0 0
sample2 1 0 0 1 1
这篇关于从压缩数据列表中创建一个非常大的稀疏矩阵csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!