问题描述
我已经用Gensim在非常短的句子(最多10个单词)的语料库上训练了快速文本模型.我知道我的测试集包含不在我的训练语料库中的单词,即我的语料库中的某些单词像催产素" "Lexitocin","Ematrophin","Betaxitocin"
I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin"
给定测试集中的一个新单词,fasttext非常了解如何通过使用字符级n-gram来生成与训练集中的其他相似单词具有高余弦相似度的向量
given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram
我如何将快速文本模型合并到LSTM keras网络中,而又不将快速文本模型丢失到词汇表中的向量列表中呢?因为那样的话,即使fasttext做得很好,我也不会处理任何OOV.
How do i incorporate the fasttext model inside a LSTM keras network without losing the fasttext model to just a list of vectors in the vocab? because then I won't handle any OOV even when fasttext do it well.
有什么主意吗?
推荐答案
此处是将快速文本模型合并到LSTM Keras网络内部的过程
here the procedure to incorporate the fasttext model inside an LSTM Keras network
# define dummy data and precproces them
docs = ['Well done',
'Good work',
'Great effort',
'nice work',
'Excellent',
'Weak',
'Poor effort',
'not good',
'poor work',
'Could have done better']
docs = [d.lower().split() for d in docs]
# train fasttext from gensim api
ft = FastText(size=10, window=2, min_count=1, seed=33)
ft.build_vocab(docs)
ft.train(docs, total_examples=ft.corpus_count, epochs=10)
# prepare text for keras neural network
max_len = 8
tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
tokenizer.fit_on_texts(docs)
sequence_docs = tokenizer.texts_to_sequences(docs)
sequence_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_docs, maxlen=max_len)
# extract fasttext learned embedding and put them in a numpy array
embedding_matrix_ft = np.random.random((len(tokenizer.word_index) + 1, ft.vector_size))
pas = 0
for word,i in tokenizer.word_index.items():
try:
embedding_matrix_ft[i] = ft.wv[word]
except:
pas+=1
# define a keras model and load the pretrained fasttext weights matrix
inp = Input(shape=(max_len,))
emb = Embedding(len(tokenizer.word_index) + 1, ft.vector_size,
weights=[embedding_matrix_ft], trainable=False)(inp)
x = LSTM(32)(emb)
out = Dense(1)(x)
model = Model(inp, out)
model.predict(sequence_docs)
如何处理看不见的文字
unseen_docs = ['asdcs work','good nxsqa zajxa']
unseen_docs = [d.lower().split() for d in unseen_docs]
sequence_unseen_docs = tokenizer.texts_to_sequences(unseen_docs)
sequence_unseen_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_unseen_docs, maxlen=max_len)
model.predict(sequence_unseen_docs)
这篇关于在keras中将Gensim Fasttext模型与LSTM nn一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!