问题描述
关于如何标记句子的指南太多,但是我却没有找到相反的方法.
There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
除了将标记化的句子还原为原始状态外,还有什么功能吗?函数tokenize.untokenize()
由于某种原因不起作用.
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()
for some reason doesn't work.
我知道我可以做到这一点,这也许可以解决问题,但我很好奇是否为此提供了集成功能:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
推荐答案
您可以使用"treebank detokenizer"-TreebankWordDetokenizer
:
You can use "treebank detokenizer" - TreebankWordDetokenizer
:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
在nltk
中也有MosesDetokenizer
,但由于许可问题而被删除,但可以作为 Sacremoses
独立程序包使用.
There is also MosesDetokenizer
which was in nltk
but got removed because of the licensing issues, but it is available as a Sacremoses
standalone package.
这篇关于Python取消标记一个句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!