问题描述
我想在一个 ImageNet 数据集上训练分类器(1000 个类,每个类有大约 1300 张图像).出于某种原因,我需要每个批次包含来自同一类的 64 张图像,以及来自不同类的连续批次.使用最新的 TensorFlow 是否可行(且高效)?
I'd like to train a classifier on one ImageNet dataset (1000 classes each with around 1300 images). For some reason, I need each batch to contain 64 images from the same class, and consecutive batches from different classes. Is it possible (and efficient) with the latest TensorFlow?
tf.contrib.data.sample_from_datasets
允许从 tf.data.Dataset
对象列表中采样,其中 weights
表示概率.我想知道以下想法是否有意义:
tf.contrib.data.sample_from_datasets
in TF 1.9 allows sampling from a list of tf.data.Dataset
objects, with weights
indicating the probabilities. I wonder if the following idea makes sense:
- 将每个类的数据保存为单独的 tfrecord 文件.
- 将
tf.data.Dataset.from_generator
对象作为weights
传递.来自分类分布的对象样本,使得每个样本看起来像[0,...,0,1,0,...,0]
和 9990
s和 11
; - 创建 1000 个
tf.data.Dataset
对象,每个对象链接一个 tfrecord 文件.
- Save data of each class as a separate tfrecord file.
- Pass a
tf.data.Dataset.from_generator
object as theweights
. The object samples from a Categorical distribution such that each sample looks like[0,...,0,1,0,...,0]
with 9990
s and 11
; - Create 1000
tf.data.Dataset
objects, each linked a tfrecord file.
我想,通过这种方式,也许在每次迭代时,sample_from_datasets
将首先采样一个稀疏权重向量,指示从哪个 tf.data.Dataset
采样,然后和那个班一样.
I thought, in this way, maybe at each iteration, sample_from_datasets
will first sample a sparse weight vector that indicates which tf.data.Dataset
to sample from, then same from that class.
正确吗?还有其他有效的方法吗?
Is it correct? Are there any other efficient ways?
更新
正如 P-Gn 建议的那样,从一个类中采样数据的一种方法是:
As kindly suggested by P-Gn, one way to sample data from one class would be:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(some_parser_fun) # parse one datum from tfrecord
dataset = dataset.shuffle(buffer_size)
if sample_same_class:
group_fun = tf.contrib.data.group_by_window(
key_func=lambda data_x, data_y: data_y,
reduce_func=lambda key, d: d.batch(batch_size),
window_size=batch_size)
dataset = dataset.apply(group_fun)
else:
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
data_batch = dataset.make_one_shot_iterator().get_next()
后续问题可以在如何对批次进行采样来自特定班级?
推荐答案
如果我理解正确的话,我认为您的解决方案行不通,因为 sample_from_dataset
需要其 sample_from_dataset
的值列表代码>权重,而不是张量
.
I don't think your solution could work, if I understand it correctly, because sample_from_dataset
expects a list of values for its weights
, not a Tensor
.
但是,如果您不介意在您提出的解决方案中有 1000 个 Dataset
,那么我建议简单
However if you don't mind having 1000 Dataset
s as in your proposed solution, then I would suggest to simply
- 为每个类创建一个
Dataset
, batch
这些数据集中的每一个——每个批次都有来自一个类的样本,zip
将它们全部打包成一个大的Dataset
批次,shuffle
thisDataset
— 混洗将发生在批次上,而不是样本上,因此不会改变批次是单一类别的事实.立>
- create one
Dataset
per class, batch
each of these datasets — each batch has samples from a single class,zip
all of them into one bigDataset
of batches,shuffle
thisDataset
— the shuffling will occur on the batches, not on the samples, so it won't change the fact that batches are single class.
更复杂的方法是依赖 tf.contrib.data.group_by_window
.让我用一个综合的例子来说明这一点.
A more sophisticated way is to rely on tf.contrib.data.group_by_window
. Let me illustrate that with a synthetic example.
import numpy as np
import tensorflow as tf
def gen():
while True:
x = np.random.normal()
label = np.random.randint(10)
yield x, label
batch_size = 4
batch = (tf.data.Dataset
.from_generator(gen, (tf.float32, tf.int64), (tf.TensorShape([]), tf.TensorShape([])))
.apply(tf.contrib.data.group_by_window(
key_func=lambda x, label: label,
reduce_func=lambda key, d: d.batch(batch_size),
window_size=batch_size))
.make_one_shot_iterator()
.get_next())
sess = tf.InteractiveSession()
sess.run(batch)
# (array([ 0.04058843, 0.2843775 , -1.8626076 , 1.1154234 ], dtype=float32),
# array([6, 6, 6, 6], dtype=int64))
sess.run(batch)
# (array([ 1.3600663, 0.5935658, -0.6740045, 1.174328 ], dtype=float32),
# array([3, 3, 3, 3], dtype=int64))
这篇关于如何在每次迭代中仅从一个类中抽样批次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!