将不等形数组的列表转换为Tensorflow 2数据集:ValueError:无法将非矩形Python序列转换为Tensor

本文介绍了将不等形数组的列表转换为Tensorflow 2数据集:ValueError:无法将非矩形Python序列转换为Tensor的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我以形状不相同的数组列表的形式标记了数据:

I have tokenized data in the form of a list of unequally shaped arrays:

array([array([1179,    6,  208,    2, 1625,   92,    9, 3870,    3, 2136,  435,
          5, 2453, 2180,   44,    1,  226,  166,    3, 4409,   49, 6728,
         ...
         10,   17, 1396,  106, 8002, 7968,  111,   33, 1130,   60,  181,
       7988, 7974, 7970])], dtype=object)

具有各自的目标:

Out[74]: array([0, 0, 0, ..., 0, 0, 1], dtype=object)

我正在尝试将它们转换为填充的tf.data.Dataset()，但是它不允许我将不相等的形状转换为张量.我会收到此错误:

I'm trying to transform them into a padded tf.data.Dataset(), but it won't let me convert unequal shapes to a tensor. I will get this error:

ValueError: Can't convert non-rectangular Python sequence to Tensor.

完整的代码在这里.假设我的起点在y = ...之后:

The full code is here. Assume that my starting point is after y = ...:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

(train_data, test_data) = tfds.load('imdb_reviews/subwords8k',
                                    split=(tfds.Split.TRAIN, tfds.Split.TEST),
                                    as_supervised=True)

x = np.array(list(train_data.as_numpy_iterator()))[:, 0]
y = np.array(list(train_data.as_numpy_iterator()))[:, 1]


train_tensor = tf.data.Dataset.from_tensor_slices((x.tolist(), y))\
    .padded_batch(batch_size=8, padded_shapes=([None], ()))

我有什么选择将其转换为填充批次张量?

What are my options to turn this into a padded batch tensor?

推荐答案

如果您的数据存储在Numpy数组或Python列表中，则可以使用 tf.data.Dataset.from_generator 方法创建数据集，然后填充批次:

If your data is stored in Numpy arrays or Python lists, then you can use tf.data.Dataset.from_generator method to create the dataset and then pad the batches:

train_batches = tf.data.Dataset.from_generator(
    lambda: iter(zip(x, y)), 
    output_types=(tf.int64, tf.int64)
).padded_batch(
    batch_size=32,
    padded_shapes=([None], ())
)

但是，如果您使用的是tensorflow_datasets.load函数，则无需使用as_numpy_iterator来分离数据和标签，然后将它们放回到数据集中！那是多余的且效率低下的. tensorflow_datasets.load返回的对象已经是tf.data.Dataset的实例.因此，您只需要在它们上使用padded_batch

However, if you are using tensorflow_datasets.load function, then there is no need to use as_numpy_iterator to separate the data and the labels, and then put them back together in a dataset! That's redundant and inefficient. The objects returned by tensorflow_datasets.load are already an instance of tf.data.Dataset. So, you just need to use padded_batch on them:

train_batches = train_data.padded_batch(batch_size=32, padded_shapes=([None], []))
test_batches = test_data.padded_batch(batch_size=32, padded_shapes=([None], []))

请注意，在TensorFlow 2.2及更高版本中，如果只希望将所有轴都填充到批次中的最长轴(即默认行为)，则不再需要提供padded_shapes参数.

Note that in TensorFlow 2.2 and above, you no longer need to provide the padded_shapes argument if you just want all the axes to be padded to the longest of the batch (i.e. default behavior).

这篇关于将不等形数组的列表转换为Tensorflow 2数据集:ValueError:无法将非矩形Python序列转换为Tensor的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

DataSet