python - Python有频率的Ngrams列表

我需要从文本中获取最流行的ngrams。g随机数长度必须介于1到5个字之间。
我知道如何得到大图和三角图。例如：

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)

然而，我发现Scikit Learn可以获得不同长度的ngrams。例如，我可以得到长度从1到5的ngrams。

v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))

但Wordngramanalyzer现在已被弃用。我的问题是：如何从我的文本中得到n个最佳的词搭配，搭配长度从1到5。另外，我需要得到这个配置/ngrams的频率列表。
我可以用NLTK/SciKit来做吗？我需要从一个文本中获得不同长度的ngrams组合？
例如，使用NLTK大图和三角图，在许多情况下，我的三角图包括我的位图，或者我的三角图是更大的4-gram的一部分。例如：
位图：你好，我的
三角函数：你好，我的名字
我知道如何从三角图中排除大图，但我需要更好的解决方案。

最佳答案

更新
自SciKit学习0.14以来，格式已更改为：

n_grams = CountVectorizer(ngram_range=(1, 5))

完整示例：

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

from sklearn.feature_extraction.text import CountVectorizer

c_vec = CountVectorizer(ngram_range=(1, 5))

# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])

# needs to happen after fit_transform()
vocab = c_vec.vocabulary_

count_values = ngrams.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
    print(ng_count, ng_text)

它输出以下内容（请注意，删除单词I不是因为它是一个停止字（不是），而是因为它的长度：https://stackoverflow.com/a/20743758/）：

> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...

现在应该/可能会简单得多，IMO。你可以尝试像textacy这样的方法，但有时也会有它自己的并发症，比如初始化一个文档，它目前不适用于v.0.6.2as shown on their docs。理论上，以下内容是可行的（但不可行）：

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

import textacy

# some version of the following line
doc = textacy.Doc([test_str1, test_str2])

ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)

旧答案
由于SciKit学习0.11，WordNGramAnalyzer确实被弃用。创建n-grams和获取term频率现在结合在If doc initialization worked as promised中。您可以创建从1到5的所有n-gram，如下所示：

n_grams = CountVectorizer(min_n=1, max_n=5)

更多示例和信息可以在Scikit Learn关于sklearn.feature_extraction.text.CountVectorizer的文档中找到。

关于python - Python有频率的Ngrams列表，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/11763613/