数据帧重新索引对象不必要地保留在内存中

数据帧重新索引对象不必要地保留在内存中

本文介绍了数据帧重新索引对象不必要地保留在内存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继续我我们实现了两个功能相同的功能,一个是使用重新索引而另一个不是。功能在第3行有所不同:

In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line:

def update(centroid):
    best_mean_dist = 200
    clust_members = members_by_centeriod[centroid]
    for member in clust_members:
        member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

def update1(centroid):
    best_mean_dist = 200
    members_in_clust = members_by_centeriod[centroid]
    new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
    for member in members_in_clust:
        member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

从IPython笔记本电脑单元调用这些函数:

The functions are being called from an IPython notebook cell:

for centroid in centroids:
    centroid = [update(centroid) for centroid in centroids]

数据帧 df 是一个大型数据帧,大约有400万行,内存大约需要300MB。

The dataframe df is a large dataframe, with around 4 million rows and takes ~300MB in memory.

使用重新索引的 update1 函数要快得多。但是,出现了意想不到的事情 - 经过几次迭代后,运行重新索引的内存很快就会从~300MB上升到1.5GB,然后我就会遇到内存冲突。

The update1 function using re-indexing is much faster. but, something unexpected happens - after just a few iterations when running the one with re-indexing the memory quickly goes up from ~300MB to 1.5GB and then I get memory violation.

更新函数不会受到这种行为的影响。我没有得到的两件事:

The update function does not suffer from this kind of behavior. 2 things I'm not getting:


  1. 重新索引制作副本,这很明显。但是每次update1函数完成时,是不是假设死了? newdf 变量应该与创建它的函数一起消失..对吗?

  1. re-indexing makes a copy, that is obvious. but isn't that copy suppose to die each time the update1 function is finished? the newdf variable should die with the function creating it.. right?

即使垃圾收集器也是如此是不是立即杀死 newdf ,一个内存耗尽,它应该杀死它而不会引发outOfMemory Exception,对吧?

Even if the garbage collector is not killing newdf right away, one memory runs out, it should kill it and not raise outOfMemory Exception, right?

我试图在update1函数结束时手动添加 del newdf 来杀死df,这没有用。所以可能表明该错误实际上是在重新索引过程中?

I tried killing df manually be adding del newdf at the end of the update1 function, that didn't help. so might that indicate that the bug is actually in the re-indexing process itself?

编辑:

我发现了问题,但我不明白这种行为的原因是什么。它是python垃圾收集器,拒绝清理重新索引的数据帧。
这是有效的:

I found the problem, but I cant understand what is the reason for this behavior. It is the python garbage collector, refusing to clean the reindexed dataframe.This is valid:

for i in range(2000):
   new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)

这也是有效的:

def reindex():
    new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
    score  = 100 - new_df.ix[member].ix[clust_members].score.mean()
    return score

for i in range(2000):
    reindex()

这会导致在内存中重新编制对象保留对象:

z = []
for i in range(2000):
    z.append(reindex())

我认为我的用法天真正确。 newdf 变量如何与分数值保持连接,为什么?

I think my usage is naively correct. how does the newdf variable stay connected to the score value, and why?

推荐答案

这是我的调试代码,当你进行索引时,Index对象将创建 _tuples 引擎映射,I认为这两个缓存对象使用了内存。如果我添加标记为 **** 的行,那么内存增加非常小,我的PC上大约6M:

Here is my debug code, when you do indexing, Index object will create _tuples and engine map, I think the memory is used by this two cache object. If I add the lines marked by ****, then the memory increase is very small, about 6M on my PC:

import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter

这篇关于数据帧重新索引对象不必要地保留在内存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 06:31