Wordpiece 标记化与传统词形还原?

本文介绍了Wordpiece 标记化与传统词形还原?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究 NLP 预处理.在某些时候，我想实现一个上下文敏感的词嵌入，作为一种辨别词义的方式，我正在考虑使用 BERT 的输出来做到这一点.我注意到 BERT 使用 WordPiece 标记化(例如，播放"->播放"+##ing").

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

现在，我使用标准分词器对文本进行预处理，该分词器在空格/一些标点符号上拆分，然后我有一个词形还原器(播放"->播放").我想知道 WordPiece 标记化与标准标记化 + 词形还原相比有什么好处.我知道 WordPiece 可以帮助解决词汇量不足的问题，但还有其他方法吗?也就是说，即使我最终没有使用 BERT，我是否应该考虑用 wordpiece tokenization 替换我的 tokenizer + lemmatizer?这在什么情况下会有用?

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?

标记化与传统词形还原

问题描述

推荐答案