python - Python-csr_matrix的数据结构

我在研究TFIDF。我使用了tfidf矢量器.fit变换。它返回一个csr_矩阵，但我无法理解结果的结构。
数据输入：
文档=（“天空是蓝色的”，“太阳是明亮的”，“太阳在
天空是明亮的，“我们可以看到明亮的太阳，明亮的太阳”
声明：

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix)

结果是：
（0，9）0.34399327143
（0，7）0.519713848879
（0，4）0.420753151645
（0，0）0.6591117868
（1，9）0.426858009784
（1，4）0.522108621994
（1，8）0.522108621994
（1，1）0.522108621994
（2，9）0.526261040111
（2，7）0.397544332095
（2，4）0.32184639876
（2，8）0.32184639876
（2，1）0.32184639876
（2，3）0.504234576856
（3，9）0.390963088213
（3，8）0.47820398015
（3，1）0.239101990075
（3，10）0.374599471224
（3，2）0.374599471224
（3，5）0.374599471224
（3，6）0.374599471224
tfidf_矩阵是csr_矩阵。所以我找到了这个，但是没有和结果一样的结构：scipy.sparse.csr_matrix
什么样的价值结构为（0，9）0.34399327143？

最佳答案

没有矢量化，我可以用这个操作序列或多或少地重新创建矩阵：

In [703]: documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun" )

获取单词列表（全部小写）：

In [704]: alist = [l.lower().split() for l in documents]

获取单词的排序列表（唯一）：

In [705]: aset = set()
In [706]: [aset.update(l) for l in alist]
Out[706]: [None, None, None, None]
In [707]: unq = sorted(list(aset))
In [708]: unq
Out[708]:
['blue',
 'bright',
 'can',
 'in',
 'is',
 'see',
 'shining',
 'sky',
 'sun',
 'the',
 'we']

浏览alist并收集字数。rows将是句子编号，cols将是唯一的单词索引

In [709]: rows, cols, data = [],[],[]
In [710]: for i,row in enumerate(alist):
     ...:     for c in row:
     ...:         rows.append(i)
     ...:         cols.append(unq.index(c))
     ...:         data.append(1)
     ...:

根据此数据生成稀疏矩阵：

In [711]: M = sparse.csr_matrix((data,(rows,cols)))
In [712]: M
Out[712]:
<4x11 sparse matrix of type '<class 'numpy.int32'>'
    with 21 stored elements in Compressed Sparse Row format>
In [713]: print(M)
  (0, 0)    1
  (0, 4)    1
  (0, 7)    1
  (0, 9)    1
  (1, 1)    1
  ....
  (3, 9)    2
  (3, 10)   1
In [714]: M.A        # viewed as 2d array
Out[714]:
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],
       [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)

因为这是使用sklearn，所以我可以用以下方法重新生成矩阵：

In [717]: from sklearn import feature_extraction
In [718]: tf = feature_extraction.text.TfidfVectorizer()
In [719]: tfM = tf.fit_transform(documents)
In [720]: tfM
Out[720]:
<4x11 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>
In [721]: print(tfM)
  (0, 9)    0.34399327143
  (0, 7)    0.519713848879
  (0, 4)    0.420753151645
  ....
  (3, 5)    0.374599471224
  (3, 6)    0.374599471224
In [722]: tfM.A
Out[722]:
array([[ 0.65919112,  0.        ,  0.        ,  0.        ,  0.42075315,
         0.        ,  0.        ,  0.51971385,  0.        ,  0.34399327,
         0.        ],....
       [ 0.        ,  0.23910199,  0.37459947,  0.        ,  0.        ,
         0.37459947,  0.37459947,  0.        ,  0.47820398,  0.39096309,
         0.37459947]])

实际数据存储为3个属性数组：

In [723]: tfM.indices
Out[723]:
array([ 9,  7,  4,  0,  9,  4,  8,  1,  9,  7,  4,  8,  1,  3,  9,  8,  1,
       10,  2,  5,  6], dtype=int32)
In [724]: tfM.data
Out[724]:
array([ 0.34399327,  0.51971385,  0.42075315,  0.65919112,  0.42685801,
       ...
        0.37459947])
In [725]: tfM.indptr
Out[725]: array([ 0,  4,  8, 14, 21], dtype=int32)

单个行的indices值告诉我们该句子中出现的单词：

In [726]: np.array(unq)[M[0,].indices]
Out[726]:
array(['blue', 'is', 'sky', 'the'],
      dtype='<U7')
In [727]: np.array(unq)[M[3,].indices]
Out[727]:
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],
      dtype='<U7')

关于python - Python-csr_matrix的数据结构，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/45678491/