我正在尝试训练一个小型网络以熟悉TensorFlow 2.0。但是,似乎tensorflow在我的计算机上无法正常工作。
这是我的代码:
import tensorflow as tf
from functools import reduce
from tensorflow.keras import layers, Sequential, datasets
import numpy as np
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class Model():
weights = []
biases = []
def weights_collect(self):
for l in self.layers:
try:
self.weights.append(l.kernel)
self.biases.append(l.bias)
except:
pass
def __init__(self):
self.layers = [
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPool2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPool2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPool2D((2, 2)),
layers.Flatten(),
layers.Dense(1024, activation='relu'),
layers.Dense(10)
]
self.model = Sequential(self.layers)
self.weights_collect()
@tf.function
def predict_logits(self, X):
return self.model(X)
@tf.function
def __call__(self, X):
return tf.nn.softmax(self.model(X))
@tf.function
def loss(m:Model, x:np.ndarray, t:np.ndarray):
logits = m.predict_logits(x)
tar = tf.one_hot(t, 10)
return tf.math.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(tar, logits))
@tf.function
def acc(m, X, T):
logits = m.predict_logits(X)
target = tf.reshape(tf.dtypes.cast(T, tf.dtypes.int32), [-1])
pred = tf.math.argmax(logits, axis=1, output_type=tf.dtypes.int32)
return tf.math.reduce_sum(tf.dtypes.cast(pred==target, tf.dtypes.int32))/tf.shape(X)[0]
BATCH_SIZE = 1
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).batch(BATCH_SIZE)
m = Model()
opt = tf.keras.optimizers.Adam()
EPOCHS = 20
for i in range(EPOCHS):
for x, t in dataset:
with tf.GradientTape() as tape:
loss_value = loss(m, x, t)
grads = tape.gradient(loss_value, m.model.trainable_weights)
opt.apply_gradients(zip(grads, m.model.trainable_weights))
print("hello")
print(float(loss(m, test_images, test_labels)))
print(float(acc(m, test_images, test_labels)))
运行此代码时,我不断收到这种错误消息:
Allocation of 1228800000 exceeds 10% of system memory.
之后,我的模型将停止训练。
我试图更改批处理大小,但仍然无法正常工作。模型经过几次迭代训练后就死了。即使我将批处理大小更改为1。
TensorFlow似乎在训练期间一直在分配系统内存,而不会释放它。
我还重新安装了整个系统,试图解决此问题,但仍然无法正常工作。
最佳答案
问题刚解决,似乎最新版本的NVIDIA Game Ready Driver(440.97)出了问题。一旦回滚到436.48,即使上面提到的错误仍然存在,代码仍将继续训练。