问题描述
当我因选择使用Tensorflow(版本1.12.0)模型的超参数而进行的网格搜索由于内存消耗激增而崩溃时,我注意到了这一点.
I noticed this when my grid search for selecting hyper-parameters of a Tensorflow (version 1.12.0) model crashed due to explosion in memory consumption.
请注意,与这里的类似问题不同,我确实关闭了图和会话(使用上下文管理器),并且没有在循环中向图添加节点.
Notice that unlike similar-looking question here, I do close the graph and session (using context managers), and I am not adding nodes to the graph in the loop.
我怀疑tensorflow可能保留了在每次迭代之间都不会清除的全局变量,因此我在迭代之前和之后都调用了globals(),但是在每次迭代之前和之后都没有观察到全局变量集合中的任何区别.
I suspected that maybe tensorflow maintains global variables that do not get cleared between iterations, so I called globals() before and after an iteration but did not observe any difference in the set of global variable before and after each iteration.
我举了一个小例子,重现了这个问题.我在循环中训练了一个简单的MNIST分类器,并绘制了该进程消耗的内存:
I made a small example that reproduces the problem. I train a simple MNIST classifier in a loop and plot the memory consumed by the process:
import matplotlib.pyplot as plt
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import psutil
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
process = psutil.Process(os.getpid())
N_REPS = 100
N_ITER = 10
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x_test, y_test = mnist.test.images, mnist.test.labels
# Runs experiment several times.
mem = []
for i in range(N_REPS):
with tf.Graph().as_default():
net = tf.contrib.layers.fully_connected(x_test, 200)
logits = tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_test, logits=logits))
train_op = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
# training loop.
sess.run(init)
for _ in range(N_ITER):
sess.run(train_op)
mem.append(process.memory_info().rss)
plt.plot(range(N_REPS), mem)
结果图如下:
在我的实际项目中,进程内存从几百MB开始(取决于数据集的大小),最高可达64 GB,直到我的系统内存不足为止.我尝试过一些使增速变慢的方法,例如使用占位符和feed_dicts而不是依赖convert_to_tensor.但是持续的增长仍然存在,只是速度较慢.
In my actual project, process memory starts from a couple of hundreds MB (depending on dataset size), and goes up to 64 GB until my system run out of memory. There are things that I tried that slow down the increase, such as using placeholders and feed_dicts instead of relying on convert_to_tensor. But the constant increase is still there, only slower.
推荐答案
尝试在会话内部进行循环.不要为每次迭代创建图和会话.每次创建图并初始化变量时,您不是在重新定义旧图,而是创建新图导致内存泄漏.我遇到了类似的问题,并且能够通过在会话内部进行循环来解决此问题.
Try and take the loop inside the session. Don't create the graph and session for every iteration. Every time the graph is created and variable initialized, you are not redefining the old graph but creating new ones leading to memory leaks. I was facing a similar issue and was able to solve this by taking the loop inside the session.
这篇关于在循环中构建图时,Tensorflow内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!