问题描述
基本上,在 python 中存储和使用密集矩阵的最佳方法是什么?
我有一个项目可以在数组中的每个项目之间生成相似性度量.
每个项目都是一个自定义类,并存储一个指向另一个类的指针和一个表示它与该类的接近度"的数字.
现在,它可以出色地运行大约 8000 个项目,之后它会因内存不足错误而失败.
基本上,如果您假设每个比较使用 ~30(根据测试似乎准确)字节来存储相似性,则意味着所需的总内存为:numItems^2 * itemSize = 内存
所以内存使用量是基于项目数量的指数.
就我而言,每个链接的内存大小约为 30 字节,因此:8000 * 8000 * 30 = 1,920,000,000 字节,或 1.9 GB
这正好是单个线程的内存限制.
在我看来,必须有一种更有效的方法来做到这一点.我研究过 memmapping,但仅仅为了生成相似值就已经是计算密集型的了,而通过硬盘驱动器来限制它似乎有点荒谬.
编辑
我看过 numpy 和 scipy.不幸的是,它们也不支持非常大的数组.
进一步编辑
Numpy 似乎很受欢迎.然而,numpy 不会真正做我想要的,至少没有另一个抽象层.
我不想想存储数字,我想存储对类的引用.Numpy 支持对象,但这并不能真正解决数组大小问题.我提出 numpy 只是作为不起作用的一个例子.
有什么建议吗?
编辑 好吧,我最终只是重写了所有逻辑,因此它不再存储任何冗余值,将内存使用量从 O*n^2
减少到 O*((n*(n-1))/2)
.
基本上,这整个事件是握手问题的一个版本,所以我已经从存储所有链接切换到每个链接只存储一个版本.
这不是一个完整的解决方案,但我通常没有足够大的数据集来溢出它,所以我认为它会奏效.PyTables 真的很有趣,但我不知道任何 SQL,而且似乎没有任何好的传统切片或基于索引的方式来访问表数据.我以后可能会重新讨论这个问题.
PyTables 可以处理任意表格大小(数百万列!),通过使用 memmap 和一些巧妙的压缩.
表面上,它为python提供了类似SQL的性能.但是,它需要对代码进行大量修改.
在我进行更彻底的审查之前,我不会接受这个答案,以确保它实际上可以做我想要的.或者有人提供了更好的解决方案.
Basically, what is the best way to go about storing and using dense matrices in python?
I have a project that generates similarity metrics between every item in an array.
Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.
Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.
It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.
Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.
>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>>
Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.
I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.
Any advice?
Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2
to O*((n*(n-1))/2)
.
Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.
It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.
PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.
Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.
I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.
这篇关于在python中处理大型密集矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!