这是我的样本数据集的样子:
我的目标是了解与一个单词,两个单词,三个单词,四个单词,五个单词和六个单词相关的印象数。我曾经运行过N-gram算法,但它只返回count。这是我当前的n-gram代码。
def find_ngrams(text, n):
word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(text)
frequencies = sum(sparse_matrix).toarray()[0]
ngram =
pd.DataFrame(frequencies,index=word_vectorizer.get_feature_names(),columns=
['frequency'])
ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram
one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)
我认为我需要做的是:1.将查询分成一个单词到六个单词。 2.将印象附加到拆分词上。 3.重新组合所有拆分词并加总印象。
以第二个查询“狗常见疾病及其治疗方法”为例。应分为:
(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;
最佳答案
这是一种方法!不是最有效的,但是,让我们不要过早地进行优化。想法是使用apply
为所有ngram用新的列获取新的pd.DataFrame
,将其与旧的数据框连接起来,并进行一些堆叠和分组。
import pandas as pd
df = pd.DataFrame({
"squery": ["how to feed a dog", "dog habits", "to cat or not to cat", "dog owners"],
"count": [1000, 200, 100, 150]
})
def n_grams(txt):
grams = list()
words = txt.split(' ')
for i in range(len(words)):
for k in range(1, len(words) - i + 1):
grams.append(" ".join(words[i:i+k]))
return pd.Series(grams)
counts = df.squery.apply(n_grams).join(df)
counts.drop("squery", axis=1).set_index("count").unstack()\
.rename("ngram").dropna().reset_index()\
.drop("level_0", axis=1).groupby("ngram")["count"].sum()
最后一个表达式将返回一个
pd.Series
,如下所示。 ngram
a 1000
a dog 1000
cat 200
cat or 100
cat or not 100
cat or not to 100
cat or not to cat 100
dog 1350
dog habits 200
dog owners 150
feed 1000
feed a 1000
feed a dog 1000
habits 200
how 1000
how to 1000
how to feed 1000
how to feed a 1000
how to feed a dog 1000
not 100
not to 100
not to cat 100
or 100
or not 100
or not to 100
or not to cat 100
owners 150
to 1200
to cat 200
to cat or 100
to cat or not 100
to cat or not to 100
to cat or not to cat 100
to feed 1000
to feed a 1000
to feed a dog 1000
卑鄙的方法
这可能会更有效率,但它仍会实现
CountVectorizer
的密集n-gram向量。它将每列上的一个与印象数相乘,然后将这些列相加以获得每ngram的印象总数。它给出与上述相同的结果。需要注意的一件事是,具有重复ngram的查询也算两倍。import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 5))
ngrams = cv.fit_transform(df.squery)
mask = np.repeat(df['count'].values.reshape(-1, 1), repeats = len(cv.vocabulary_), axis = 1)
index = list(map(lambda x: x[0], sorted(cv.vocabulary_.items(), key = lambda x: x[1])))
pd.Series(np.multiply(mask, ngrams.toarray()).sum(axis = 0), name = "counts", index = index)
关于python - 基于Python印象的N-gram分析,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/43528296/