问题描述
我正在研究 NLP 预处理.在某些时候,我想实现一个上下文敏感的词嵌入,作为一种辨别词义的方式,我正在考虑使用 BERT 的输出来做到这一点.我注意到 BERT 使用 WordPiece 标记化(例如,播放"->播放"+##ing").
I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").
现在,我使用标准分词器对文本进行预处理,该分词器在空格/一些标点符号上拆分,然后我有一个词形还原器(播放"->播放").我想知道 WordPiece 标记化与标准标记化 + 词形还原相比有什么好处.我知道 WordPiece 可以帮助解决词汇量不足的问题,但还有其他方法吗?也就是说,即使我最终没有使用 BERT,我是否应该考虑用 wordpiece tokenization 替换我的 tokenizer + lemmatizer?这在什么情况下会有用?
Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?
推荐答案
word-piece 标记化在很多方面都有帮助,应该比 lemmatizer 更好.由于多种原因:
The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:
- 如果将playful"、playing"、played"等词词形还原为play",则可能会丢失一些信息,例如
playing
是现在时和played
是过去式,这在词片标记化中不会发生. - 词片标记涵盖了所有词,甚至字典中没有出现的词.它拆分单词并且会有词块标记,这样,您应该对拆分的词块进行嵌入,而不是删除单词或替换为未知"标记.
- If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as
playing
is present-tense andplayed
is past-tense, which doesn't happen in word-piece tokenization. - Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.
使用词块分词代替分词器+词形还原器只是一种设计选择,词块分词应该表现良好.但是您可能必须考虑计数,因为词块标记化会增加标记的数量,而词形还原并非如此.
Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.
这篇关于Wordpiece 标记化与传统词形还原?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!