【一】声明
本文源自TensorFlow官方指导(https://tensorflow.google.cn/tutorials/sequences/text_generation),增加了部分细节说明。
【二】综述
1. tf.keras与keras有如下三个较大的不同点
1):opt必须是tf.train模块下的opt,不能是keras下的opt
2):tf.keras模型默认的保存格式时check-point,不是h5.
3): tf.keras进行模型训练、推理时,input_data可以直接传递tf.data.Dataset
2. TensorFlow的官方案例,是依据字符进行文本生成的,基本流程是,给定一个句子,预测其下一个字符。因此该模型不知道单词是怎么拼写的,不知道字成词,因为他是字符级的,它只知道预测下一个字符。因此,可能生成了不存在的单词或者词语。
3. 该模型只有三层(char-embedding、GRU、FC),但参数巨大,训练十分缓慢(i7CPU训练一个epoch差不多半个小时)。而且这里,char-embedding是直接训练出来了,而不是通过fasttext或者gensim训练出来,然后在做fine-tuning的。
【三】代码如下:
# -*- coding:utf-8 -*-
import tensorflow as tf
import numpy as np
import os
import time
tf.enable_eager_execution()
# 1. 数据下载
path = tf.keras.utils.get_file('shakespeare.txt',
'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
#2. 数据预处理
with open(path) as f:
# text 是个字符串
text = f.read()
# 3. 将组成文本的字符全部提取出来,注意 vocab是个list
vocab = sorted(set(text))
# 4. 创建text-->int的映射关系
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])
# 5. 借用dataset的batch方法,将text划分为定长的句子
seq_length = 100
examples_per_epoch = len(text)//seq_length
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
# 这里batch_size 加1的原因在于,下面对inputs和labels的生成。labels比inputs多一个字符
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
# 6. 将每个句子划分为inputs和labels。例如:hello,inputs = hell,label=ello
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
# 7. 将句子划分为一个个batch
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE
BUFFER_SIZE = 10000
# drop_remainder 一般需要设置为true,表示当最后一组数据不够划分为一个batch时,将这组数据丢弃
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
# 8. 模型搭建
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
model = tf.keras.Sequential()
# 这里是字符embedding,所以是字符集大小*embedding_dim
model.add(tf.keras.layers.Embedding(input_dim=vocab_size,output_dim=embedding_dim,
batch_input_shape=[BATCH_SIZE,None]))
model.add(tf.keras.layers.GRU(units=rnn_units,
return_sequences=True,
recurrent_initializer='glorot_uniform',
stateful=True))
model.add(tf.keras.layers.Dense(units=vocab_size))
model.summary()
# 9. 模型配置
# optimizer 必须为 tf.train 下的opt,不能是keras下的opt
model.compile(optimizer=tf.train.AdamOptimizer(),loss=tf.losses.sparse_softmax_cross_entropy)
# 10 .设置回调函数
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./logs/')
# 11. 训练模型,repeat() 表示dataset无限循环,不然这里数据可能不够30个epochs
model.fit(dataset.repeat(),epochs=30,
steps_per_epoch=steps_per_epoch,
callbacks=[checkpoint_callback,tensorboard_callback])
# 12 .模型保存
# 保存为keras模型格式
model.save_weights(filepath='./models/gen_text_with_char_on_rnn.h5',save_format='h5')
# 保存为TensorFlow的格式
model.save_weights(filepath='./models/gen_text_with_char_on_rnn_check_point')
# 13. 模型生成文本
def generate_text(model, start_string):
# Evaluation step (generating text using the learned model)
# Number of characters to generate
num_generate = 1000
# You can change the start string to experiment
start_string = 'ROMEO'
# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
# Empty string to store our results
text_generated = []
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0
# Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
# remove the batch dimension
predictions = tf.squeeze(predictions, 0)
# using a multinomial distribution to predict the word returned by the model
predictions = predictions / temperature
predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy()
# We pass the predicted word as the next input to the model
# along with the previous hidden state
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return (start_string + ''.join(text_generated))
print(generate_text(model, start_string="ROMEO: "))
【四】总结
1. 关于tf.keras更多的内容,可以参考官方网站(https://tensorflow.google.cn/guide/keras)
2. 关于tf.dataset的更多内容,可以参考官方网站(https://tensorflow.google.cn/guide/datasets)和另外一篇博客(https://my.oschina.net/u/3800567/blog/1637798)
3. 可以完全使用tf.keras,不再使用keras。二者功能与接口一致,tf.keras提供了更多的对TensorFlow的支持