Python取消标记一个句子 | Python取消标记一个句子

本文介绍了Python取消标记一个句子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

关于如何标记句子的指南太多，但是我却没有找到相反的方法.

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

除了将标记化的句子还原为原始状态外，还有什么功能吗?函数tokenize.untokenize()由于某种原因不起作用.

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

我知道我可以做到这一点，这也许可以解决问题，但我很好奇是否为此提供了集成功能:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')

推荐答案

您可以使用"treebank detokenizer"-TreebankWordDetokenizer:

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

在nltk中也有MosesDetokenizer，但由于许可问题而被删除，但可以作为 Sacremoses独立程序包使用.

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

这篇关于Python取消标记一个句子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！