本文介绍了更改batch()、shuffle()和repeat()顺序时的输出差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个 tensorflow 数据集,使其可重复,将其打乱,将其分成多个批次,并构建了一个迭代器来获取下一个批次.但是当我这样做时,有时元素是重复的(批次内和批次之间),尤其是对于小数据集.为什么?

I have created a tensorflow dataset, made it repeatable, shuffled it, divided it into batches, and have constructed an iterator to get the next batch. But when I do this, sometimes the elements are repetitive (within and among batches), especially for small datasets. Why?

推荐答案

与您自己的答案中所述不同,不,改组然后重复不会解决您的问题.

Unlike what stated in your own answer, no, shuffling and then repeating won't fix your problems.

关键问题的根源是您批处理,然后随机播放/重复.这样,批次中的项目将始终取自输入数据集中的连续样本.批处理应该是您在输入管道中执行的最后一项操作.

The key source of your problem is that you batch, then shuffle/repeat. That way, the items in your batches will always be taken from contiguous samples in the input dataset.Batching should be one of the last operations you do in your input pipeline.

现在,您的洗牌、重复和批处理的顺序有所不同,但这并不是您所想的.引用输入管道性能指南:

Now, there is a difference in the order in which you shuffle, repeat and batch, but it's not what you think. Quoting from the input pipeline performance guide:

如果在shuffle之前应用了repeat变换转换,那么时代边界就变得模糊了.那是,某些元素甚至可以在其他元素出现之前重复一次.另一方面,如果应用了 shuffle 变换在重复转换之前,性能可能会在每个 epoch 的开始与内部的初始化相关shuffle 变换的状态.换言之,前者(repeat before shuffle) 提供更好的性能,而后者(shuffle before repeat) 提供更强的排序保证.

回顾

  • 重复,然后洗牌:您失去了在一个时期内处理所有样本的保证.
  • 先洗牌,然后重复:保证在下一次重复开始之前所有样本都将被处理,但性能会(略有)下降.
  • 无论您选择哪种方式,批处理之前都要这样做.

    Whichever you choose, do that before batching.

    这篇关于更改batch()、shuffle()和repeat()顺序时的输出差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 08:08