将1.2GB的边缘列表转换为稀疏矩阵

本文介绍了将1.2GB的边缘列表转换为稀疏矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个1.2GB的边缘列表，来自文本文件中的图形.我的ubuntu PC有8GB的RAM.输入的每一行看起来像

I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like

287111206 357850135

我想将其转换为稀疏邻接矩阵并将其输出到文件中.

I would like to convert it into a sparse adjacency matrix and output that to a file.

我的数据的一些统计数据:

Some statistics for my data:

Number of edges: around 62500000
Number of vertices: around 31250000

我之前在> https://stackoverflow.com/a/38667644/2179021 上问了同样的问题，并得到了答案一个很好的答案.问题是我无法正常工作.

I asked much the same question before at https://stackoverflow.com/a/38667644/2179021 and got a great answer. The problem is that I can't get it to work.

我首先尝试使用np.loadtxt加载文件，但速度非常慢，并且占用了大量内存.因此，我改为移动到pandas.read_csv，速度非常快，但这导致了它自己的问题.这是我当前的代码:

I first tried np.loadtxt to load in the file but it was very slow and used a huge amount of memory. So instead I moved to pandas.read_csv which is very fast but this caused it own problems. This is my current code:

import pandas
import numpy as np
from scipy import sparse

data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)

问题是熊猫数据帧data很大，我实际上在A中复制效率很低.但是，随着

The problem is that the pandas dataframe data is huge and I am effectively making a copy in A which is inefficient. However things are even worse as the code crashes with

<type 'instancemethod'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 13, in <module>
    rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py
<type 'numpy.ndarray'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 12, in <module>
    k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
    iflag = np.cumsum(flag) - 1
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
    return cumsum(axis, dtype, out)
MemoryError

所以我的问题是:

我可以避免在内存中同时包含1.2GB的pandas数据帧和1.2GB的numpy数组吗?
是否有某种方法可以使代码在8GB的RAM中完成?

您可以重现我要处理的大小的测试输入:

You can reproduce a test input of the size I am trying to process with:

import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
    fromnode = str(random.randint(0, n-1)).zfill(9)
    tonode = str(random.randint(0, n-1)).zfill(9)
    print fromnode, tonode

更新

我现在尝试了许多不同的方法，但都失败了.这是一个摘要.

I have now tried a number of different approaches, all of which have failed. Here is a summary.

将 igraph 与g = Graph.Read_Ncol('edges.txt')一起使用.这会使用大量的RAM，这会使我的计算机崩溃.
将 networkit 与G= networkit.graphio.readGraph("edges.txt",networkit.Format.EdgeList, separator=" ", continuous=False)一起使用.这会使用大量的RAM，这会使我的计算机崩溃.
此问题中的上述代码，但使用np.loadtxt("edges.txt")而不是熊猫.这会使用大量的RAM，这会使我的计算机崩溃.

Using igraph with g = Graph.Read_Ncol('edges.txt'). This uses a huge amount of RAM which crashes my computer.
Using networkit with G= networkit.graphio.readGraph("edges.txt",networkit.Format.EdgeList, separator=" ", continuous=False). This uses a huge amount of RAM which crashes my computer.
The code above in this question but using np.loadtxt("edges.txt") instead of pandas. This uses a huge amount of RAM which crashes my computer.

然后，我编写了单独的代码，将所有顶点名称重新映射为从1.开始的数字. | V |是顶点的总数.这样就可以节省导入边缘列表的代码，而不必建立映射顶点名称的表.我尝试使用这个:

I then wrote separate code which remapped all the vertex names to number from 1..|V| where |V| is the total number of vertices. This should save the code that imports the edge list from having to build up a table that maps the vertex names. Using this I tried:

使用这个新的重新映射的边缘列表文件，我再次将igraph与g = Graph.Read_Edgelist("edges-contig.txt")一起使用.尽管这需要4GB的RAM(这比理论上的数量要多得多)，但现在可以使用.但是，没有igraph函数可以从图中写出稀疏邻接矩阵.推荐的解决方案是将图形转换为coo_matrix .不幸的是，这会占用大量的RAM，这会使我的计算机崩溃.
使用重新映射的边缘列表文件，我将Networkit与G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne)一起使用.使用少于igraph所需的4GB的内存也可以使用此功能. networkit还具有编写Matlab文件的功能(这是scipy可以读取的稀疏邻接矩阵的一种形式).但是networkit.graphio.writeMat(G,"test.mat")使用大量的RAM，这会使我的计算机崩溃.

Using this new remapped edge list file I used igraph again with g = Graph.Read_Edgelist("edges-contig.txt"). This now works although it takes 4GB of RAM (which is way more than the theoretical amount it should). However, there is no igraph function to write out a sparse adjacency matrix from a graph. The recommended solution is to convert the graph to a coo_matrix. Unfortunately this uses a huge amount of RAM which crashes my computer.
Using the remapped edge list file I used networkit with G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne). This also works using less than the 4GB that igraph needs. networkit also comes with a function to write Matlab files (which is a form of sparse adjacency matrix that scipy can read). However networkit.graphio.writeMat(G,"test.mat") uses a huge amount of RAM which crashes my computer.

最后，萨莎的回答确实完成了，但大约需要40分钟.

Finally sascha's answer below does complete but takes about 40 minutes.

推荐答案

这是我的解决方案:

import numpy as np
import pandas as pd
import scipy.sparse as ss

def read_data_file_as_coo_matrix(filename='edges.txt'):
    "Read data file and return sparse matrix in coordinate format."
    data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
    rows = data[0]  # Not a copy, just a reference.
    cols = data[1]
    ones = np.ones(len(rows), np.uint32)
    matrix = ss.coo_matrix((ones, (rows, cols)))
    return matrix

Pandas使用read_csv进行繁重的解析.熊猫已经以列格式存储数据. data[0]和data[1]仅获得引用，没有副本.然后，将它们喂入coo_matrix.在本地进行基准测试:

Pandas does the heavy lifting of parsing using read_csv. And Pandas is already storing the data in columnar format. The data[0] and data[1] just get references, no copies. Then I feed those to coo_matrix. Benchmarked locally:

In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop

然后将csr矩阵保存到文件:

Then to save a csr-matrix to a file:

def save_csr_matrix(filename, matrix):
    """Save compressed sparse row (csr) matrix to file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    attributes = {
        'data': matrix.data,
        'indices': matrix.indices,
        'indptr': matrix.indptr,
        'shape': matrix.shape,
    }
    np.savez(filename, **attributes)

在本地进行基准测试

In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop

然后将其从文件加载回去:

And later load it back from a file:

def load_csr_matrix(filename):
    """Load compressed sparse row (csr) matrix from file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    loader = np.load(filename)
    args = (loader['data'], loader['indices'], loader['indptr'])
    matrix = ss.csr_matrix(args, shape=loader['shape'])
    return matrix

在本地进行基准测试

In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop

最后测试一下:

def test():
    "Test data file parsing and matrix serialization."
    coo_matrix = read_data_file_as_coo_matrix()
    csr_matrix = coo_matrix.tocsr()
    save_csr_matrix('edges.npz', csr_matrix)
    loaded_csr_matrix = load_csr_matrix('edges.npz')
    # Comparison based on http://stackoverflow.com/a/30685839/232571
    assert (csr_matrix != loaded_csr_matrix).nnz == 0

if __name__ == '__main__':
    test()

运行test()时，大约需要30秒:

When running test(), it takes about 30 seconds:

$ time python so_38688062.py
real    0m30.401s
user    0m27.257s
sys     0m2.759s

并且内存高水位标记为〜1.79 GB.

And the memory high-water mark was ~1.79 GB.

请注意，一旦您将"edges.txt"转换为CSR-matrix格式的"edges.npz"，加载将花费不到一秒钟的时间.

Note that once you've converted the "edges.txt" to "edges.npz" in CSR-matrix format, loading it will take less than a second.

这篇关于将1.2GB的边缘列表转换为稀疏矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

readGraph