问题描述
我正在处理一些数据,并将结果存储在三个字典中,并使用Pickle将它们保存到磁盘中.每个字典有500-1000MB.
I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.
现在我正在加载它们:
import pickle
with open('dict1.txt', "rb") as myFile:
dict1 = pickle.load(myFile)
但是,在加载第一本字典时,我已经得到:
However, already at loading the first dictionary I get:
*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
self.stack.append({})
MemoryError
该如何解决?我的计算机有16GB的RAM,因此我发现加载800MB字典崩溃会很不正常.我还发现不寻常的是,保存字典时没有问题.
How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.
此外,将来我计划处理更多的数据,从而产生更大的字典(磁盘上有3-4GB),因此,任何有关提高效率的建议都将受到欢迎.
Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.
推荐答案
如果字典中的数据是numpy
数组,则可以使用某些程序包(例如joblib
和klepto
)来提高大型数组的酸洗效率,因为klepto
和joblib
都了解如何为numpy.array
使用最小状态表示.如果您没有array
数据,我的建议是使用klepto
将词典条目存储在多个文件(而不是单个文件)或数据库中.
If your data in the dictionaries are numpy
arrays, there are packages (such as joblib
and klepto
) that make pickling large arrays efficient, as both the klepto
and joblib
understand how to use minimal state representation for a numpy.array
. If you don't have array
data, my suggestion would be to use klepto
to store the dictionary entries in several files (instead of a single file) or to a database.
请参阅我对一个非常相关的问题的回答 https://stackoverflow.com/a/25244747/2379433 ,如果可以将多个文件而不是单个文件进行酸洗,则想并行保存/加载数据,或者想轻松地尝试一种存储格式和后端以查看哪种格式最适合您的情况.另请参见: https://stackoverflow.com/a/21948720/2379433 以获得其他可能的改进,也请参见: https://stackoverflow.com/a/24471659/2379433 .
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
正如上面的链接所讨论的,您可以使用klepto
-使用通用的API,您可以轻松地将字典存储到磁盘或数据库中. klepto
还使您能够选择一种存储格式(pickle
,json
等).HDF5
(或SQL数据库)也是另一个不错的选择,因为它允许并行访问. klepto
可以同时使用特殊的pickle格式(例如numpy
)和压缩(如果您关心大小而不是访问数据的速度).
As the links above discuss, you could use klepto
-- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto
also enables you to pick a storage format (pickle
, json
, etc.) --also HDF5
(or a SQL database) is another good option as it allows parallel access. klepto
can utilize both specialized pickle formats (like numpy
's) and compression (if you care about size and not speed of accessing the data).
klepto
使您可以选择将字典与多合一"文件或一个条目"存储在一起,并且还可以利用多处理或多线程处理-这意味着您可以保存和加载字典后端并行处理项目.有关示例,请参见上面的链接.
klepto
gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.
这篇关于使用Python中的Pickle的MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!