python - tensorflow_hub将BERT嵌入Windows机器-扩展到albert

最近，我发布了此question并试图解决我的问题。我的问题是

我的方法正确吗？
我的示例句子长度分别为7和6-(['New Delhi is the capital of India', 'The capital of India is Delhi'])，即使我添加cls和sep令牌，长度也分别为9和8。max_seq_len参数为10，那么为什么x1和x2的最后一行却没有相同？
当我的段落超过2个句子时，如何嵌入？我必须一次通过一个句子吗？但是在这种情况下，由于我没有将所有句子都传递在一起，我是否会丢失信息？

我进行了一些其他研究，似乎可以将段落中所有单词的segment_ids设为0，将整个段落作为单个句子传递。那是对的吗？

如何为ALBERT嵌入？我看到ALBERT也有tokenization.py文件。但我看不到vocab.txt。我看到文件30k-clean.vocab。我可以使用30k-clean.vocab代替vocab.txt吗？

最佳答案

@ user2543622，您可以参考官方代码here，在这种情况下，您可以执行以下操作：

import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then
tokenization_info = albert_module(signature="tokenization_info",
                                  as_dict=True)
with tf.Session() as sess:
  vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                        tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'

我猜这个vocab_file是二进制sentencepiece模型文件，因此您应该按以下方式对此进行标记化，而不要使用30k-clean.vocab。

# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
  vocab_file=vocab_file, do_lower_case=do_lower_case,
  spm_model_file=FLAGS.spm_model_file)

如果只需要嵌入矩阵值，请查看albert_module.variable_map，例如：

print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>