本文介绍了NLTK 分词器和斯坦福 corenlp 分词器不能在句点 (.)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我的数据集中有 2 个句子:

w1 = 我是 Puseen the cat.我太可爱了.# 句号后没有空格
w2 = 我是猫 Pusheen.我很可爱.# 句点后有空格

当我使用 NKTL 标记器(word 和 sent)时,nltk 无法区分 cat.I.


>>>nltk.word_tokenize(w1, 'english')['我', '我', 'Pusheen', 'the', 'cat.I', 'am', 'so', '可爱']>>>nltk.word_tokenize(w2, 'english')['我', '我', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', '可爱']


>>>nltk.sent_tokenize(w1, 'english')['我是Pusheen猫.我很可爱']>>>nltk.sent_tokenize(w2, 'english')['我是Pusheen猫.','我很可爱']

我想问一下怎么解决?即:在我的数据集中使 nlkt 检测为 w2,有时单词和标点符号会粘在一起.

更新:尝试了 Stanford CoreNLP 3.7.0,他们也不能将 'cat.I' 区分为 'cat'、'.'、'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt一世是普辛这猫.I是所以可爱的.PTBTokenizer 以每秒 111.21 个令牌的速度对 9 个令牌进行令牌化.

它是故意这样实现的——后面没有空格的句号通常并不表示句子的结束(想想短语中的句号,例如如4.3 版"、即"、AM"等).如果您有一个语料库,其中句号结尾没有空格的情况很常见,您必须在将文本发送到 NLTK 之前使用正则表达式或类似表达式对文本进行预处理.


导入重新w1 = re.sub(r'([a-z])\.([A-Z])', r'\1.\2', w1)

I have 2 sentences in my dataset:

w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period

When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.

Here is word tokenize

>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']

and sent tokenize

>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']

I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.

Update:Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.

It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.

A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.

import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)

这篇关于NLTK 分词器和斯坦福 corenlp 分词器不能在句点 (.)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 11:01