我有一组唯一的ngram(称为ngramlist的列表)和ngram标记化的文本(称为ngrams的列表)。我想构造一个新的向量freqlist,其中freqlist的每个元素都是ngram的分数,它等于ngramlist的该元素。我写了下面的代码,给出了正确的输出,但是我想知道是否有一种优化它的方法:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]


我想nltk或其他地方有一个函数可以更快地完成此操作,但我不确定哪个函数。

谢谢!

编辑:对于什么值来说,这些ngram是作为nltk.util.ngrams的联合输出生成的,而ngramlist只是从所有找到的ngram的集合中得出的列表。

编辑2:

这是测试频率列表行的可复制代码(其余代码并不是我真正关心的)

from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd

articles = ['New York City','Moscow','Beijing']
tokenizer  = nltk.tokenize.TreebankWordTokenizer()

data={'article':[],'treebank_tokenizer':[]}
for article in articles:
    data['article' ].append(wikipedia.page(article).content)
    data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))

df=pd.DataFrame(data)

df['ngrams-3']=df['treebank_tokenizer'].map(
    lambda x: [' '.join(t) for t in ngrams(x,3)])

ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))

df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])

最佳答案

您可以通过预先计算一些数量并使用Counter来稍微优化一下。如果ngramlist中的大多数元素都包含在ngrams中,这将特别有用。

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]


当然,您不必每次检查ngrams时都遍历ngram。跳过ngrams将使此算法成为O(n),而不是您现在拥有的O(n)。记住,较短的代码不一定是更好或更有效的代码:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]


要正确使用此功能,您必须编写def函数而不是lambda

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

09-27 17:45