问题描述
继续我我们实现了两个功能相同的功能,一个是使用重新索引而另一个不是。功能在第3行有所不同:
In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line:
def update(centroid):
best_mean_dist = 200
clust_members = members_by_centeriod[centroid]
for member in clust_members:
member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
def update1(centroid):
best_mean_dist = 200
members_in_clust = members_by_centeriod[centroid]
new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
for member in members_in_clust:
member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
从IPython笔记本电脑单元调用这些函数:
The functions are being called from an IPython notebook cell:
for centroid in centroids:
centroid = [update(centroid) for centroid in centroids]
数据帧 df
是一个大型数据帧,大约有400万行,内存大约需要300MB。
The dataframe df
is a large dataframe, with around 4 million rows and takes ~300MB in memory.
使用重新索引的 update1
函数要快得多。但是,出现了意想不到的事情 - 经过几次迭代后,运行重新索引的内存很快就会从~300MB上升到1.5GB,然后我就会遇到内存冲突。
The update1
function using re-indexing is much faster. but, something unexpected happens - after just a few iterations when running the one with re-indexing the memory quickly goes up from ~300MB to 1.5GB and then I get memory violation.
更新
函数不会受到这种行为的影响。我没有得到的两件事:
The update
function does not suffer from this kind of behavior. 2 things I'm not getting:
-
重新索引制作副本,这很明显。但是每次update1函数完成时,是不是假设死了?
newdf
变量应该与创建它的函数一起消失..对吗?
re-indexing makes a copy, that is obvious. but isn't that copy suppose to die each time the update1 function is finished? the
newdf
variable should die with the function creating it.. right?
即使垃圾收集器也是如此是不是立即杀死 newdf
,一个内存耗尽,它应该杀死它而不会引发outOfMemory Exception,对吧?
Even if the garbage collector is not killing newdf
right away, one memory runs out, it should kill it and not raise outOfMemory Exception, right?
我试图在update1函数结束时手动添加 del newdf
来杀死df,这没有用。所以可能表明该错误实际上是在重新索引过程中?
I tried killing df manually be adding del newdf
at the end of the update1 function, that didn't help. so might that indicate that the bug is actually in the re-indexing process itself?
编辑:
我发现了问题,但我不明白这种行为的原因是什么。它是python垃圾收集器,拒绝清理重新索引的数据帧。
这是有效的:
I found the problem, but I cant understand what is the reason for this behavior. It is the python garbage collector, refusing to clean the reindexed dataframe.This is valid:
for i in range(2000):
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
这也是有效的:
def reindex():
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
score = 100 - new_df.ix[member].ix[clust_members].score.mean()
return score
for i in range(2000):
reindex()
这会导致在内存中重新编制对象保留对象:
z = []
for i in range(2000):
z.append(reindex())
我认为我的用法天真正确。 newdf
变量如何与分数值保持连接,为什么?
I think my usage is naively correct. how does the newdf
variable stay connected to the score value, and why?
推荐答案
这是我的调试代码,当你进行索引时,Index对象将创建 _tuples
和引擎映射
,I认为这两个缓存对象使用了内存。如果我添加标记为 ****
的行,那么内存增加非常小,我的PC上大约6M:
Here is my debug code, when you do indexing, Index object will create _tuples
and engine map
, I think the memory is used by this two cache object. If I add the lines marked by ****
, then the memory increase is very small, about 6M on my PC:
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc
def get_memory():
pid = os.getpid()
p = psutil.Process(pid)
return p.get_memory_info().rss
def get_object_ids():
return set(id(obj) for obj in gc.get_objects())
m1 = get_memory()
n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])
ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))
m2 = get_memory()
objs1 = get_object_ids()
z = []
for i in range(5):
df2 = df.reindex(ix, level=0).reindex(iy, level=1)
z.append(df2.mean().mean())
df.index._tuples = None # ****
df.index._cleanup() # ****
del df2
gc.collect() # ****
m3 = get_memory()
print (m2-m1)/1e6, (m3-m2)/1e6
from collections import Counter
counter = Counter()
for obj in gc.get_objects():
if id(obj) not in objs1:
typename = type(obj).__name__
counter[typename] += 1
print counter
这篇关于数据帧重新索引对象不必要地保留在内存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!