问题描述
我正在使用NLTK来分析一些经典文本,并且遇到了麻烦来逐句标记文本的问题.例如,这是我从 Moby Dick 摘录的内容:
I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:
import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'
print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''
考虑到梅尔维尔(Melville)的语法有些陈旧,我不希望在这里达到完美,但是NLTK应该能够处理终端双引号和标题,例如"Mrs".但是,由于令牌生成器是无监督训练算法的结果,所以我不知道该如何修改.
I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.
有人建议使用更好的句子标记器吗?我更喜欢可以破解的简单启发式方法,而不必训练自己的解析器.
Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.
推荐答案
您需要向令牌生成器提供缩写列表,如下所示:
You need to supply a list of abbreviations to the tokenizer, like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)
句子现在是
['is THAT what you mean, Mrs. Hussey?']
更新:如果句子的最后一个单词带有单引号或引号(例如 Hussey?'),则此方法不起作用.因此,一种快速而又肮脏的方法是在撇号和引号之前加上空格,然后在引号后面加上句号(.!?):
Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
这篇关于如何调整NLTK句子标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!