问题描述
我有一个熊猫数据框,其中有两列,一列具有图像的路径,另一列具有字符串类标签.
I have a pandas dataframe with two columns, one that has paths to images and the other has string class labels.
我还编写了以下函数,这些函数从数据帧中加载图像,对其进行规范化并将类标签转换为一键矢量.
I have also written the following functions, which from the dataframe loads the images, renormalizes them and converts the class labels to one-hot vectors.
def prepare_data(df):
data_X, data_y = df.values[:,0], df.values[:,1]
# Load images
data_X = np.array([np.array(imread(fname)) for fname in data_X])
# Normalize input
data_X = data_X / 255 - 0.5
# Prepare labels
data_y = np.array([label2int[label] for label in data_y])
data_y = to_categorical(data_y)
return data_X, data_y
我想将此数据帧提供给Keras CNN,但整个数据集太大,无法立即加载到内存中.
I want to feed this dataframe to a Keras CNN, but the whole dataset is too big to be loaded in memory at once.
该站点上的其他答案告诉我,为此,我应该使用Keras ImageDataGenerator,但是老实说,我不了解如何从文档中了解该方法.
Other answers in this site tell me that for that purpose I should use a Keras ImageDataGenerator, but honestly I do not understand how to do it from the documentation.
将延迟加载的批次中的数据馈送到模型的最简单方法是什么?
What is the easiest way of feeding the data in lazy loaded batches to the model?
如果它是ImageDataGenerator,如何创建一个ImageDataGenerator来对Dataframe进行初始化,并将批处理通过我的函数传递,以创建适当的numpy数组?以及如何使用ImageDataGenerator拟合模型?
If it is a ImageDataGenerator, how do I create a ImageDataGenerator that takes on initialization the Dataframe and passes the batches through my function to create the appropriate numpy arrays? And how do I fit the model using the ImageDataGenerator?
推荐答案
ImageDataGenerator
是一个高级类,它允许从多个来源(从np arrays
,目录...)产生数据.进行图像增强等功能.
ImageDataGenerator
is a high-level class that allows to yield data from multiple sources (from np arrays
, from directories...) and that includes utility functions to perform image augmentation et cetera.
更新
从 keras预处理 1.0开始. 4,ImageDataGenerator
带有 flow_from_dataframe
方法,它可以解决您的情况.它需要定义如下的dataframe
和directory
参数:
As of keras-preprocessing 1.0.4, ImageDataGenerator
comes with a flow_from_dataframe
method which addresses your case. It requires dataframe
and directory
arguments defined as follows:
dataframe: Pandas dataframe containing the filenames of the
images in a column and classes in another or column/s
that can be fed as raw target data.
directory: string, path to the target directory that contains all
the images mapped in the dataframe.
因此不再需要自己实施.
So no more need to implement it yourself.
下面的原始答案
在您的情况下,使用描述的数据框,您还可以编写自己的自定义生成器,该生成器将prepare_data
函数中的逻辑用作更简单的解决方案.最好使用Keras的Sequence
对象这样做,因为它允许使用多处理(如果您使用的是gpu,这将有助于避免瓶颈).
In your case, with the dataframe as you describe it, you could also write your own custom generator that makes use of the logic in your prepare_data
function as a more minimalistic solution. It's good practice to make use of Keras' Sequence
object to do so, since it allows to use multiprocessing (which will help to avoid bottlenecking your gpu, if you are using one).
您可以在Sequence
对象上签出 docs ,其中包含一个实现示例.最终,您的代码将遵循以下原则(这是样板代码,您将不得不添加诸如label2int
函数或图像预处理逻辑之类的细节):
You can check out the docs on the Sequence
object, it contains an implementation example. Eventually, your code would be something along these lines (this is boilerplate code, you will have to add specifics like your label2int
function or the image preprocessing logic):
from keras.utils import Sequence
class DataSequence(Sequence):
"""
Keras Sequence object to train a model on larger-than-memory data.
"""
def __init__(self, df, batch_size, mode='train'):
self.df = df # your pandas dataframe
self.bsz = batch_size # batch size
self.mode = mode # shuffle when in train mode
# Take labels and a list of image locations in memory
self.labels = self.df['label'].values
self.im_list = self.df['image_name'].tolist()
def __len__(self):
# compute number of batches to yield
return int(math.ceil(len(self.df) / float(self.bsz)))
def on_epoch_end(self):
# Shuffles indexes after each epoch if in training mode
self.indexes = range(len(self.im_list))
if self.mode == 'train':
self.indexes = random.sample(self.indexes, k=len(self.indexes))
def get_batch_labels(self, idx):
# Fetch a batch of labels
return self.labels[idx * self.bsz: (idx + 1) * self.bsz]
def get_batch_features(self, idx):
# Fetch a batch of inputs
return np.array([imread(im) for im in self.im_list[idx * self.bsz: (1 + idx) * self.bsz]])
def __getitem__(self, idx):
batch_x = self.get_batch_features(idx)
batch_y = self.get_batch_labels(idx)
return batch_x, batch_y
您可以像自定义生成器一样传递此对象来训练模型:
You can pass this object to train your model just like a custom generator:
sequence = DataSequence(dataframe, batch_size)
model.fit_generator(sequence, epochs=1, use_multiprocessing=True)
如下所述,不需要实现改组逻辑.在fit_generator()
调用中将shuffle
参数设置为True
就足够了.从文档:
As noted below, it is not required to implement the shuffling logic. It suffices to set the shuffle
argument to True
in the fit_generator()
call. From the docs:
这篇关于从 pandas 数据帧在Keras中加载一批图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!