问题描述
我希望能够以这种方式一个接一个地获得句子的POS标签:
I want to be able to get POS-Tags of sentences one by one like in this manner:
def __remove_stop_words(self, tokenized_text, stop_words):
sentences_pos = nltk.pos_tag(tokenized_text)
filtered_words = [word for (word, pos) in sentences_pos
if pos not in stop_words and word not in stop_words]
return filtered_words
但是问题是每个句子的pos_tag()
大约需要一秒钟.还有一个选择使用pos_tag_sents()
批量执行此操作并加快处理速度.但是,如果我能逐句地做这件事,我的生活会更轻松.
But the problem is that pos_tag()
takes about a second for each sentence. There is another option to use pos_tag_sents()
to do this batch-wise and speed things up. But my life would be easier if I could do this sentence by sentence.
有没有办法更快地做到这一点?
Is there a way to do this faster?
推荐答案
对于nltk版本3.1,位于,pos_tag
的定义如下:
For nltk version 3.1, inside nltk/tag/__init__.py
, pos_tag
is defined like this:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
因此,每次对pos_tag
的调用都会首先实例化PerceptronTagger
,这会花费一些时间,因为它涉及到. _pos_tag
在tagset
为None
.因此,您可以通过一次加载文件 并自己调用tagger.tag
而不是调用pos_tag
:
So each call to pos_tag
first instantiates PerceptronTagger
which takes some time because it involves loading a pickle file. _pos_tag
simply calls tagger.tag
when tagset
is None
.So you can save some time by loading the file once, and calling tagger.tag
yourself instead of calling pos_tag
:
from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
sentences_pos = tagger.tag(tokenized_text)
filtered_words = [word for (word, pos) in sentences_pos
if pos not in stop_words and word not in stop_words]
return filtered_words
pos_tag_sents
使用与上述相同的技巧-实例化一次,然后多次调用_pos_tag
.因此,使用上述代码,您将获得与重构和调用pos_tag_sents
相同的性能提升.
pos_tag_sents
uses the same trick as above -- it instantiates PerceptronTagger
once before calling _pos_tag
many times. So you'll get a comparable gain in performance using the above code as you would by refactoring and calling pos_tag_sents
.
此外,如果stop_words
是长列表,则可以通过将stop_words
设置为一组来节省时间:
Also, if stop_words
is a long list, you may save a bit of time by making stop_words
a set:
stop_words = set(stop_words)
因为检查集合中的成员资格(例如pos not in stop_words
)是O(1)
(恒定时间)操作,而检查列表中的成员资格是O(n)
操作(即,它需要的时间与时间长度成正比增长)列表.)
since checking membership in a set (e.g. pos not in stop_words
) is a O(1)
(constant time) operation while checking membership in a list is a O(n)
operation (i.e. it requires time which grows proportionally to the length of the list.)
这篇关于为什么pos_tag()如此缓慢,却可以避免呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!