本文介绍了PyTorch:动态计算图之间的关系-Padding-DataLoader的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,PyTorch 的优势在于它可以与动态计算图配合使用.在 NLP 的上下文中,这意味着具有可变长度的序列不一定需要填充到相同的长度.但是,如果我想使用 PyTorch DataLoader,无论如何我都需要填充我的序列,因为 DataLoader 只接受张量 - 鉴于我作为一个初学者不想构建一些自定义的 collat​​e_fn.

现在这让我想知道 - 在这种情况下,这不是抹去了动态计算图的全部优势吗?此外,如果我填充我的序列以将它作为张量输入 DataLoader,最后有许多零作为填充标记(在单词 id 的情况下),它是否会对我的训练产生任何负面影响,因为 PyTorch 可能没有针对使用填充序列进行计算(因为整个前提是它可以处理动态图中的可变序列长度),还是根本没有任何区别?

我也会在 PyTorch 论坛上发布这个问题...

谢谢!

解决方案

这意味着您不需要填充序列除非您正在进行数据批处理,这是目前在 PyTorch 中添加并行性的唯一方法.DyNet 有一个方法叫做autobatching(详细介绍了在本文中)对图形操作而不是数据进行批处理,所以这可能是您想要的调查一下.

但是,如果我想使用 PyTorch DataLoader,无论如何我都需要填充我的序列,因为 DataLoader 只接受张量 - 鉴于我作为一个初学者不想构建一些自定义的 collat​​e_fn.

如果您编写自己的 Dataset 类并且使用 batch_size=1,则可以使用 DataLoader.扭曲是为您的可变长度序列使用 numpy 数组(否则 default_collat​​e 会给您带来困难):

from torch.utils.data 导入数据集从 torch.utils.data.dataloader 导入 DataLoader类 FooDataset(数据集):def __init__(self, data, target):断言 len(data) == len(target)self.data = 数据self.target = 目标def __getitem__(self, index):返回 self.data[index], self.target[index]def __len__(self):返回 len(self.data)数据 = [[1,2,3], [4,5,6,7,8]]data = [np.array(n) for n in data]目标 = ['a', 'b']ds = FooDataset(数据,目标)dl = 数据加载器(ds,batch_size=1)打印(列表(枚举(dl)))# [(0, [# 1 2 3# [大小为 1x3 的torch.LongTensor]# , ('a',)]), (1, [# 4 5 6 7 8# [大小为 1x5 的torch.LongTensor]# , ('b',)])]

现在这让我想知道 - 在这种情况下,这是否会抵消动态计算图的全部优势?

公平点,但动态计算图的主要优势是(至少目前)主要是使用调试工具(如 pdb)的可能性,这会迅速减少您的开发时间.使用静态计算图进行调试要困难得多.PyTorch 也没有理由在未来不实施进一步的即时优化或类似于 DyNet 的自动批处理的概念.

此外,如果我填充我的序列以将其作为张量输入 DataLoader,最后有许多零作为填充标记 [...],它会对我的训练 [...] 产生任何负面影响吗?

是的,在运行时和渐变中都是如此.RNN 会像普通数据一样迭代填充,这意味着您必须以某种方式处理它.PyTorch 为你提供了处理填充序列和 RNN 的工具,即 pad_packed_sequencepack_padded_sequence.这些将让您在 RNN 执行期间忽略填充元素,但要注意:这不适用于您自己实现的 RNN(或者至少如果您不手动添加对它的支持).

As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.

Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?

I will also post this question in the PyTorch Forum...

Thanks!

解决方案

This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.

You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

class FooDataset(Dataset):
    def __init__(self, data, target):
        assert len(data) == len(target)
        self.data = data
        self.target = target
    def __getitem__(self, index):
        return self.data[index], self.target[index]
    def __len__(self):
        return len(self.data)

data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']

ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)

print(list(enumerate(dl)))
# [(0, [
#  1  2  3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
#  4  5  6  7  8
# [torch.LongTensor of size 1x5]
# , ('b',)])]

Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.

Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).

这篇关于PyTorch:动态计算图之间的关系-Padding-DataLoader的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-27 19:34