问题描述
处理数据,并希望创建一个稀疏矩阵,以后再用于聚类.
Working with data and would like to create a sparse matrix to later be used for clustering purposes.
fileHandle = open('data', 'r')
for line in fileHandle:
json_list = []
fields = line.split('\t')
json_list.append(fields[0])
json_list.append(fields[1])
json_list.append(fields[3])
现在数据看起来像这样:
Right now the data looks like this:
term, ids, quantity
['buick', '123,234', '500']
['chevy', '345,456', '300']
['suv','123', '100']
我需要的输出如下:
term, quantity, '123', '234', '345', '456', '567'
buick, 500, 1, 1, 0, 0, 0
chevy, 300, 0, 0, 1, 1, 0
suv, 100, 1, 0, 0, 0, 0
我尝试使用numpy稀疏矩阵库,但没有成功.
I've tried working with numpy sparse matrix library but with no success.
推荐答案
scikit_learn
可能具有轻松实现此目的的工具,但我将演示一个基本的Python/numpy解决方案.
scikit_learn
probably has the tools to do this easily, but I'll demonstrate a basic Python/numpy solution.
原始数据-列表列表
In [1150]: data=[['buick', '123,234', '500'],
['chevy', '345,456', '300'],
['suv','123', '100']]
我可以通过列表理解提取出很多列.在很大的情况下,这可能不是最快的方法,但就目前而言,这是一种逐步解决问题的简便方法.
I can pull out verious columns with list comprehensions. This might not be the fastest in a very large case, but for now it's an easy way to tackle the issue piece by piece.
In [1151]: terms=[row[0] for row in data]
In [1152]: terms
Out[1152]: ['buick', 'chevy', 'suv']
In [1153]: quantities=[int(row[2]) for row in data]
In [1154]: quantities
Out[1154]: [500, 300, 100]
创建可能的ID列表.我可以从data
中提取它们,但是您显然正在使用更大的列表.它们可以是字符串,而不是整数.
Create the list of possible ids. I could pull these from data
, but you apparently are using a larger list. They could be strings instead of ints.
In [1155]: idset=[123,234,345,456,567]
In [1156]: ids=[[int(i) for i in row[1].split(',')] for row in data]
In [1157]: ids
Out[1157]: [[123, 234], [345, 456], [123]]
np.in1d
是一个方便的工具,用于查找那些子列表在主列表中的合适位置.生成的idM
是特征矩阵,具有很多0和几个.
np.in1d
is a handy tool for finding where those sublists fit in the master list. The resulting idM
is the feature matrix, with lots of 0s and a few ones.
In [1158]: idM=np.array([np.in1d(idset,i) for i in ids],int)
In [1159]: idM
Out[1159]:
array([[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 0, 0, 0, 0]])
我们可以用各种方式组装零件.
We could assemble the pieces in various ways.
例如,可以使用以下方法创建结构化数组:
For example a structured array could be created with:
In [1161]: M=np.zeros(len(data),dtype='U10,int,(5)int')
In [1162]: M['f0']=terms
In [1163]: M['f1']=quantities
In [1164]: M['f2']=idM
In [1165]: M
Out[1165]:
array([('buick', 500, [1, 1, 0, 0, 0]), ('chevy', 300, [0, 0, 1, 1, 0]),
('suv', 100, [1, 0, 0, 0, 0])],
dtype=[('f0', '<U10'), ('f1', '<i4'), ('f2', '<i4', (5,))])
idM
可以通过以下方式变成稀疏矩阵:
idM
could be turned into a sparse matrix with:
In [1167]: from scipy import sparse
In [1168]: c=sparse.coo_matrix(idM)
In [1169]: c
Out[1169]:
<3x5 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
In [1170]: c.A
Out[1170]:
array([[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 0, 0, 0, 0]])
在此探索中,首先创建更密集的数组并从中进行稀疏变得容易.
In this exploration it was easier to create the denser array first, and make a sparse from that.
但是sparse
提供了bmat
函数,该函数使我可以从单行矩阵的列表中创建多行矩阵. (有关直接构建coo
输入的版本,请参见我的编辑历史记录)
But sparse
provides a bmat
function that lets me create the multirow matrix from a list of single row ones. (see my edit history for a version that constructs the coo
inputs directly)
In [1220]: ll=[[sparse.coo_matrix(np.in1d(idset,i),dtype=int)] for i in ids]
In [1221]: sparse.bmat(ll)
Out[1221]:
<3x5 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
In [1222]: sparse.bmat(ll).A
Out[1222]:
array([[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 0, 0, 0, 0]], dtype=int32)
这篇关于在Python中创建稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!