问题描述
我目前正在kaggle中从事Jupyter笔记本的研究.在numpy数组上执行所需的转换后,我对其进行了腌制,以便可以将其存储在磁盘上.我这样做的原因是为了释放大型阵列消耗的内存.
I am currently working on a jupyter notebook in kaggle. After performing the desired transformations on my numpy array, I pickled it so that it can be stored on disk. The reason I did that is so that I can free up the memory being consumed by the large array.
腌制阵列后消耗的内存约为8.7 gb.
The memory consumed after pickling the array was about 8.7 gb.
我决定运行@ jan-glx在此处提供的代码段,以找出消耗我的变量内存:
I decided to run this code snippet provided by @jan-glx here , to find out what variables were consuming my memory:
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
执行完此步骤后,我注意到我的数组的大小为3.3 gb,所有其他变量的总和约为0.1 gb.
After performing this step I noticed that the size of my array was 3.3 gb, and the size of all the other variables summed together was about 0.1 gb.
我决定通过执行以下操作删除该数组,并查看是否可以解决该问题:
I decided to delete the array and see if that would fix the problem, by performing the following:
del my_array
gc.collect()
执行此操作后,内存消耗从8.7 gb减少到5.4 gb.从理论上讲这是有道理的,但是仍然没有解释剩余的内存被消耗了什么.
After doing this, the memory consumption decreased from 8.7 gb to 5.4 gb. Which in theory makes sense, but still didn't explain what the rest of the memory was being consumed by.
无论如何,我决定继续进行操作,并重置所有变量,以查看是否可以通过以下方式释放内存:
I decided to continue anyways and reset all my variables to see whether this would free up the memory or not with:
%reset
如预期的那样,它释放了上面函数中打印出的变量的内存,而我仍然剩下5.3 gb的正在使用的内存.
As expected it freed up the memory of the variables that were printed out in the function above, and I was still left with 5.3 gb of memory in use.
要注意的一件事是,我在腌制文件本身时注意到内存高峰,因此该过程的摘要将是这样的:
One thing to note is that I noticed a memory spike when pickling the file itself, so a summary of the process would be something like this:
- 对阵列执行的操作->内存消耗从大约1.9 gb增加到5.6 gb
- 腌制文件->内存消耗从5.6 gb增加到大约8.7 gb
- 当文件被酸洗到15.2 gb时,内存突然激增,然后又下降到8.7 gb.
- 删除的阵列->内存消耗从8.7 gb减少到5.4 gb
- 执行重置->内存消耗从5.4 gb降低至5.3 gb
请注意,以上内容是基于监视kaggle上的内存的松散方式,可能不准确.我还检查了这个问题,但并非如此对我的情况很有帮助.
Please note that the above is loosely based of monitoring the memory on kaggle and may be inaccurate.I have also checked this question but it was not helpful for my case.
这会被视为内存泄漏吗?如果是这样,在这种情况下我该怎么办?
Would this be considered a memory leak? If so, what do I do in this case?
进一步研究之后,我注意到有其他面临此问题.此问题源于酸洗过程,酸洗会在内存中创建一个副本,但由于某种原因不会释放它.酸洗过程完成后,是否可以释放内存.
After some further digging, I noticed that there are others facing this problem. This problem stems from the pickling process, and that pickling creates a copy in memory but, for some reason, does not release it. Is there a way to release the memory after the pickling process is complete.
从磁盘删除腌制的文件时,使用:
When deleting the pickled file from disk, using:
!rm my_array
最终释放了磁盘空间,还释放了内存空间.我不知道上面的小窍门是否有用,但我还是决定将其包括在内,因为每一点信息都可能会有所帮助.
It ended up freeing the disk space and freeing up space on memory as well. I don't know whether the above tidbit would be of use or not, but I decided to include it anyways as every bit of info might help.
推荐答案
您应该意识到一个基本的缺点:CPython解释器实际上实际上几乎无法释放内存并将其返回给操作系统.对于大多数工作负载,您可以假定在解释程序的生命周期内没有释放内存.但是,解释器可以在内部重新使用内存.因此,从操作系统的角度来看CPython进程的内存消耗确实没有任何帮助.一个相当普遍的解决方法是在子进程/工作进程中运行内存密集型作业(通过例如多处理)和仅"返回结果到主进程.一旦工人死亡,实际上就释放了内存.
There is one basic drawback that you should be aware of: The CPython interpreter actually can actually barely free memory and return it to the OS. For most workloads, you can assume that memory is not freed during the lifetime of the interpreter's process. However, the interpreter can re-use the memory internally. So looking at the memory consumption of the CPython process from the operating system's perspective really does not help at all. A rather common work-around is to run memory intensive jobs in a sub-process / worker process (via multiprocessing for instance) and "only" return the result to the main process. Once the worker dies, the memory is actually freed.
其次,在ndarray
上使用sys.getsizeof
可能会令人误解.请改用ndarray.nbytes
属性,请注意,在处理视图时,这也可能会产生误导. .
Second, using sys.getsizeof
on ndarray
s can be impressively misleading. Use the ndarray.nbytes
property instead and be aware that this may also be misleading when dealing with views.
此外,我不完全确定为什么您会"pi". numpy数组.有更好的工具来完成这项工作.仅举两个例子: h5py (基于 HDF5 )和 zarr .这两个库都允许您直接在磁盘上(和压缩)使用类似ndarray
的对象-基本上消除了酸洗步骤.此外,zarr还允许您创建压缩 兼容数据结构.必须ufunc
来自numpy,scipy&朋友会很乐意接受它们作为输入参数.
Besides, I am not entirely sure why you "pickle" numpy arrays. There are better tools for this job. Just to name two: h5py (a classic, based on HDF5) and zarr. Both libraries allow you to work with ndarray
-like objects directly on disk (and compression) - essentially eliminating the pickling step. Besides, zarr also allows you to create compressed ndarray
-compatible data structures in memory. Must ufunc
s from numpy, scipy & friends will happily accept them as input parameters.
这篇关于Jupyter笔记本电脑内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!