目前,在使用Keras训练图像数据时,我正在处理一个大数据问题。我的目录中有一批.npy文件。每批包含512张图像。每个批次的对应标签文件均为.npy。看起来像是:{image_file_1.npy,label_file_1.npy,...,image_file_37.npy,label_file_37}。每个图像文件的尺寸为(512, 199, 199, 3)
,每个标签文件的尺寸为(512, 1)
(为1或0)。如果我将所有图像加载到一个ndarray中,则将超过35 GB。到目前为止,已阅读了所有《 Keras Doc》。我仍然找不到如何使用自定义生成器进行训练的方法。我已经阅读了有关flow_from_dict
和ImageDataGenerator(...).flow()
的信息,但是在这种情况下它们并不理想,或者我不知道如何自定义它们。
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)
model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['acc'])
model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
validation_data=val_gen.flow(x_test,
y_test,
batch_size=64),
validation_steps=32)
因此,此处
generate_batch_from_directory()
应该每次都使用image_file_i.npy
和label_file_i.npy
并优化权重,直到没有剩余的批次为止。 .npy文件中的每个图像数组都已进行了扩充,旋转和缩放处理。每个.npy
文件都已与来自类别1和0(50/50)的数据正确混合。如果我添加所有批处理并创建一个大文件,例如:
X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])
它不适合内存。否则,我可以使用
.flow()
生成图像集来训练模型。感谢您的任何建议。
最佳答案
终于,我能够解决这个问题。但是我必须仔细阅读keras.utils.Sequence
的源代码和文档才能构建自己的生成器类。 This document有助于理解生成器在Kears中的工作方式。您可以在我的kaggle notebook中阅读更多详细信息:
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
image_label_map = {
"image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]
class DataGenerator(keras.utils.Sequence):
def __init__(self, file_list):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(len(self.file_list))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index:(index+1)]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X, y = self.__data_generation(file_list_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
data_loc = "datapsycho/imglake/population/train/image_files/"
# Generate data
for ID in file_list_temp:
x_file_path = os.path.join(data_loc, ID)
y_file_path = os.path.join(data_loc, image_label_map.get(ID))
# Store sample
X = np.load(x_file_path)
# Store class
y = np.load(y_file_path)
return X, y
# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator
hst = model.fit_generator(generator=training_generator,
epochs=200,
validation_data=validation_generator,
use_multiprocessing=True,
max_queue_size=32)