问题描述
我有一个看起来像这样的sparse.txt:
I have a sparse.txt that looks like this:
# first column is label 0 or 1
# rest of the data is sparse data
# maximum value in the data is 4, so the future dense matrix will
# have 1+4 = 5 elements in a row
# file: sparse.txt
1 1:1 2:1 3:1
0 1:1 4:1
1 2:1 3:1 4:1
所需的density.txt是这样的:
The required dense.txt is this:
# required file: dense.txt
1 1 1 1 0
0 1 0 0 1
1 0 1 1 1
无需使用scipy coo_matrix,它就可以像这样简单地完成此操作:
Without using scipy coo_matrix it did it in a simple way like this:
def create_dense(fsparse, fdense,fvocab):
# number of lines in vocab
lvocab = sum(1 for line in open(fvocab))
# create dense file
with open(fsparse) as fi, open(fdense,'w') as fo:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
l = " ".join(map(str,row)) + "\n"
fo.write(l)
print('Writing dense matrix line: ', i+1)
问题我们如何能直接从稀疏数据中获取标签和数据,而无需先创建密集矩阵并最好使用NUMPY/Scipy?
QuestionHow can we directly get label and data from sparse data without first creating dense matrix and using NUMPY /Scipy preferably??
问题:我们如何使用numpy.fromregex读取稀疏数据?
Question:How can we read the sparse data using numpy.fromregex ?
我的尝试是:
def read_file(fsparse):
regex = r'([0-1]\s)([0-9]):(1\s)*([0-9]:1)' + r'\s*\n'
data = np.fromregex(fsparse,regex,dtype=str)
print(data,file=open('dense.txt','w'))
它没有用!
相关链接:
推荐答案
调整代码以直接创建密集数组,而不是通过文件:
Tweaking your code to create the dense array directly, rather via file:
fsparse = 'stack47266965.txt'
def create_dense(fsparse, fdense, lvocab):
alist = []
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
alist.append(row)
return alist
alist = create_dense(fsparse, fdense, 4)
print(alist)
import numpy as np
arr = np.array(alist)
from scipy import sparse
M = sparse.coo_matrix(arr)
print(M)
print(M.A)
产生
0926:~/mypy$ python3 stack47266965.py
[[1, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 1, 1]]
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(1, 1) 1
(1, 4) 1
(2, 0) 1
(2, 2) 1
(2, 3) 1
(2, 4) 1
[[1 1 1 1 0]
[0 1 0 0 1]
[1 0 1 1 1]]
如果要跳过密集的arr
,则需要生成与M.row
,M.col
和M.data
属性等效的属性(顺序无关紧要)
If you want to skip the dense arr
, you need to generate the equivalent of the M.row
,M.col
, and M.data
attributes (order doesn't matter)
[0 0 0 0 1 1 2 2 2 2]
[0 1 2 3 1 4 0 2 3 4]
[1 1 1 1 1 1 1 1 1 1]
我不太使用regex
,所以我不会尝试修复它.我想你想转换
I don't use regex
much so I won't try to fix that. I assume you want to convert
'1 1:1 2:1 3:1'
进入
['1' '1' '2' '2' '1' '3' '1']
但这只是使您进入words/label
阶段.
But that just gets you to the words/label
stage.
直接稀疏:
def create_sparse(fsparse, lvocab):
row, col, data = [],[],[]
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
row.append(i); col.append(0); data.append(label)
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
for j in indices: # quick-n-dirty version
row.append(i); col.append(j); data.append(1)
return row, col, data
r,c,d = create_sparse(fsparse, 4)
print(r,c,d)
M = sparse.coo_matrix((d,(r,c)))
print(M)
print(M.A)
生产
[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2] [0, 1, 2, 3, 0, 1, 4, 0, 2, 3, 4] [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]
....
唯一不同的是一个值为0的data
项.sparse
会解决这个问题.
The only thing that's different is the one data
item with value 0. sparse
will take care of that.
这篇关于有效地从稀疏矩阵创建密集矩阵(numpy/scipy,但没有sklearn)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!