问题描述
据我了解,PyTorch的优势应该在于它可以与动态计算图一起使用.在NLP的上下文中,这意味着长度可变的序列不一定需要填充为相同的长度.但是,如果我想使用PyTorch DataLoader,则无论如何都要填充序列,因为DataLoader只采用张量-鉴于我作为一个初学者,完全不想构建一些自定义的collate_fn.
As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
现在,这让我感到奇怪-在这种情况下,这难道不消除动态计算图的全部优势吗?另外,如果我填充序列以将其作为张量输入到DataLoader中,并在末尾添加许多零作为填充令牌(在单词id的情况下),这会对我的训练有负面影响,因为PyTorch可能无法针对带填充序列的计算(因为整个前提是它可以在动态图中使用可变的序列长度),还是根本没有任何区别?
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
我还将在PyTorch论坛上发布此问题...
I will also post this question in the PyTorch Forum...
谢谢!
推荐答案
这意味着您无需填充序列,除非您正在进行数据批处理,这是目前在PyTorch中添加并行性的唯一方法. DyNet有一种称为 autobatching 的方法(在本文)对图形操作(而不是数据)进行批处理,因此这可能就是您想要的调查.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
如果您编写了自己的Dataset
类并且正在使用batch_size=1
,则可以使用DataLoader
.难点是对可变长度序列使用numpy数组(否则default_collate
会给您带来麻烦):
You can use the DataLoader
given you write your own Dataset
class and you are using batch_size=1
. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate
will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
公平,但是动态计算图的主要优点(至少当前是),主要是使用诸如pdb之类的调试工具来迅速减少开发时间的可能性.使用静态计算图进行调试会更加困难.也没有理由为什么PyTorch将来不会实施进一步的即时优化或类似于DyNet的自动批处理的概念.
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
是的,无论是在运行时还是在渐变中. RNN会像普通数据一样对填充进行迭代,这意味着您必须以某种方式进行处理. PyTorch为您提供了用于处理填充序列和RNN的工具,即 pad_packed_sequence
和 pack_padded_sequence
.这些将使您在执行RNN时忽略填充的元素,但要注意:这不适用于您自己实现的RNN(或者,如果您不手动添加对它的支持,至少不会如此).
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence
and pack_padded_sequence
. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).
这篇关于PyTorch:动态计算图之间的关系-填充-DataLoader的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!