问题描述
我正在尝试标记 TensorFlow 数据集中的单个列.如果只有一个特征列,我一直在使用的方法效果很好,例如:
I am attempting to tokenize a single column in a TensorFlow Dataset. The approach I've been using works well if there is only a single feature column, example:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
df = pd.DataFrame({"text": text,
"target": target})
training_dataset = (
tf.data.Dataset.from_tensor_slices((
tf.cast(df.text.values, tf.string),
tf.cast(df.target, tf.int32))))
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
但是,当我尝试在有一组特征列的情况下执行此操作时,例如从 make_csv_dataset
(每个特征列的名称)出来时,上述方法失败.(ValueError:尝试将值(OrderedDict([]) 转换为张量.
).
However when I try to do this where there are a set of feature columns, say coming out of make_csv_dataset
(where each feature column is named) the above methodology fails. (ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.
).
我尝试使用以下方法在 for 循环中引用特定功能列:
I attempted to reference a specific feature column within the for loop using:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
gender = [1, 0]
age = [45, 35]
df = pd.DataFrame({"text": text,
"target": target,
"gender": gender,
"age": age})
df.to_csv('test.csv', index=False)
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target')
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
我收到错误:预期的二进制或 unicode 字符串,得到数组([])
.引用单个特征列以便标记化的正确方法是什么?通常,您可以在 .map
函数中使用 feature['column_name']
方法引用特征列,例如:
I get the error: Expected binary or unicode string, got array([])
. What is the proper way to reference a single feature column so that I can tokenize? Typically you can reference a feature column using the feature['column_name']
approach within a .map
function, example:
def new_age_func(features, target):
age = features['age']
features['age'] = age/2
return features, targets
dataset = dataset.map(new_age_func)
for features, target in dataset.take(2):
print('Features: {}, Target {}'.format(features, target))
我尝试组合方法并通过地图函数生成词汇表.
I tried combining approaches and generating the vocabulary list via a map function.
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
def vocab_generator(features, target):
text = features['text']
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
dataset = dataset.map(vocab_generator)
但这会导致错误:
AttributeError: in user code:
<ipython-input-61-374e4c375b58>:10 vocab_generator *
tokens = tokenizer.tokenize(text.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
并将 tokenizer.tokenize(text.numpy())
更改为 tokenizer.tokenize(text)
会引发另一个错误 TypeError: Expected binary or unicode string,得到 <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
and changing tokenizer.tokenize(text.numpy())
to tokenizer.tokenize(text)
throws another error TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
推荐答案
错误只是 tokenizer.tokenize
需要一个字符串,而您给它的是一个列表.这个简单的编辑会起作用.我只是做了一个循环,将所有字符串提供给标记器,而不是给它一个字符串列表.
The error is just that tokenizer.tokenize
expects a string and you're giving it a list. This simple edit will work. I just made a loop that gives all strings to the tokenizer instead of giving it a list of strings.
dataset = tf.data.experimental.make_csv_dataset(
'test.csv',
batch_size=2,
label_name='target',
num_epochs=1)
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for features, _ in dataset:
text = features['text']
if lowercase:
text = tf.strings.lower(text)
for t in text:
tokens = tokenizer.tokenize(t.numpy())
vocabulary.update(tokens)
这篇关于在多特征 TensorFlow 数据集中引用和标记单个特征列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!