问题描述
我正在使用 NLTK 在语料库中搜索 n-gram,但在某些情况下需要很长时间.我注意到计算 n-grams 在其他包中并不是一个不常见的功能(显然 Haystack 有一些功能).这是否意味着如果我放弃 NLTK,有可能更快地在我的语料库中查找 n-gram?如果是这样,我可以用什么来加快速度?
I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?
推荐答案
由于您没有指明您想要单词级还是字符级 n-gram,我将假设前者,不失一般性.
Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
我还假设您从一个由字符串表示的令牌列表开始.您可以轻松地自己编写 n-gram 提取.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
然后将 yield
替换为您想要对每个 n-gram 执行的实际操作(将其添加到 dict
,将其存储在数据库中,无论如何)以摆脱生成器的开销.
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
最后,如果真的不够快,把上面的转换成Cython 并编译.使用 defaultdict
而不是 yield
的示例:
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count
这篇关于快速 n-gram 计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!