问题描述
我一直在使用NLTK的POS标记器:
I've been using NLTK's POS tagger:
...
nltk.pos_tag(nltk.word_tokenize(tfile[i]))
...
,但是有时我得到的结果不准确(当我应该获得JJ时,NN等).我要标记的文本在相当特定的业务领域内……我不太愿意在这里说出什么领域).诚然,我既不是Python也不是NLTK的专家(但是正在研究),但是我想知道是否有某种方法可以提高标记器的准确性.
but sometimes I get inaccurate results (NN when I should be getting JJ, and so forth. The text I want to tag is within a fairly specific business domain... I'm not quite at liberty to say what domain here). Admittedly, I'm not an expert with either Python or the NLTK (working on it, however), but I was wondering if there were some way to improve the accuracy of the tagger.
我想我理解标记器的工作原理是将提供给它的文本与带有预标记文本的语料库进行比较.我的自然倾向是尝试向该语料库添加一组我自己的自标记句子...但是我不知道该怎么做.
I think I understand that the tagger works by comparing the text given to it to a corpus of pretagged text. My natural inclination is to try to add a set of my own self-tagged sentences to this corpus... but I don't know how to do this.
我非常感谢任何关于如何将自己的文本添加到语料库的建议(我宁愿添加到现有的而不是完全开始一个新的建议),或者如果有人有其他建议来改进出于我的目的,标记器的准确性,我很想听听.
I'd greatly appreciate any advice on how to either add (I'd prefer to add to an existing one as opposed to start a new one entirely) my own text to the corpus, or if anyone has other suggestions for improving the tagger's accuracy for my purposes, I'd love to hear it.
谢谢!
推荐答案
您可能已经看过 GoogleCode书在nltk 上.我一直非常缓慢地自行处理它,虽然还没有解决POS标记问题,但是当我觉得自己足够熟练地使用该工具时,这是我最终想要做的事情之一.无论如何,在第5章的第2部分中,您将获得以下内容和示例:制作自己的标记令牌集(对所有人表示歉意,但我直接从文本中复制了此内容):
You have probably already seen the GoogleCode book on nltk. I've been working through it very slowly on my own and while I have yet to tackle POS-tagging, it's one of the things I ultimately want to do when I feel adept enough to use the tool. At any rate, in Chapter 5, section 2 you get the following text and examples on making your own set of tagged tokens (apologies to all, but I copied directly from the text):
>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
'NN'
接续5.2:
>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
上面的已发送"变量实际上就是原始标记文本的样子,已确认到我自己计算机上的nltk_data目录并查看corpora/brown/中的任何内容,因此您可以使用以下命令编写自己的标记文本这种格式,然后使用它构建您自己的标记令牌集.
That "sent" variable up above is actually what raw tagged text looks like, as confirmed by going to the nltk_data directory on my own computer and looking at anything in corpora/brown/, so you could write your own tagged text using this formatting and then build your own set of tagged tokens with it.
一旦您设置了自己的标记令牌,就应该能够基于标记的令牌(从5.5开始)设置自己的unigram标记:
Once you have set-up up your own tagged tokens you should then be able to set up your own unigram tagger based on your tagged tokens (from 5.5):
>>>unigram_tagger = nltk.UnigramTagger(YOUR_OWN_TAGGED_TOKENS)
最后,由于您标记的文本可能只是很小的样本(因此不准确),因此您可以列出一个后备标记器,以便当失败时,可以使用后备标记:
Finally, because your tagged text is likely to be a really small sample (and thus inaccurate), you can list a fallback tagger, so that when it fails, the fallback comes to the rescue:
>>> t0 = nltk.UnigramTagger(a_bigger_set_of_tagged_tokens)
>>> t1 = nltk.UnigramTagger(your_own_tagged_tokens, backoff=t0)
最后,您应该研究前面提到的第5章中介绍的n-gram差异,二元组,unigram等.
Lastly, you should look into the n-gram differences, bigram, unigram, etc., also covered in the aforementioned Chapter 5.
无论如何,如果您继续阅读第5章,将会看到几种不同的标记文本的方式(包括我最喜欢的regex标记器!).有很多方法可以做到,而且太复杂了,以至于无法在这样的小文章中充分介绍.
At any rate, if you continue reading through Chapter 5, you'll see a few different ways of tagging text (including my favorite: the regex tagger!). There's a lot of ways to do this and much too complex to cover adequately in a small post like this.
随心所欲的人:我还没有尝试过这段代码的所有 ,所以我提供了它作为我目前正在努力解决的解决方案.如果我犯了错误,请帮助我改正它们.
Caveat emptor: I haven't tried all of this code, so I offer it as a solution I am currently, myself, trying to work out. If I have made errors, please help me correct them.
这篇关于使用python NLTK:如何提高POS标记器的准确性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!