我犯了一个愚蠢的错误,那就是不对我的计数矢量化器进行腌制,而是列出了它所产生的所有nGrams,例如3500个功能。
现在我的问题是我需要从此nGrams列表中加载countVectorizer模型,无论如何我可以这样做吗?当前,该列表位于pd.dataframe中。
我希望我可以做类似的事情
CV = CountVectorizer(“ loadMyListofnGrams”)
任何帮助将非常感激!
最佳答案
您可以通过使用n元语法列表训练CountVectorizer来实现此目的。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
ngrams = ['coffee', 'darkly', 'darkly colored', 'bitter', 'stimulating',
'drinks', 'stimulating drinks']
new_docs = [
'Coffee is darkly colored, bitter, slightly acidic and \
has a stimulating effect in humans, primarily due to its \
caffeine content.[3] ',
'It is one of the most popular drinks \
in the world,[4] and it can be prepared and presented in a \
variety of ways (e.g., espresso, French press, caffè latte). '
]
# Instantiate CountVectorizer and train it with your ngrams
cv = CountVectorizer(ngram_range=(1, 2))
cv.fit(ngrams)
cv.vocabulary_
# Apply the vectorizer to new documents and display the dense matrix
counts = cv.transform(new_docs)
counts.A
# Turn the results into a data frame
counts_df = pd.DataFrame(counts.A, columns=cv.get_feature_names())
counts_df
输出量
cv.vocabulary_
Out[10]:
{'coffee': 1,
'darkly': 3,
'colored': 2,
'darkly colored': 4,
'bitter': 0,
'stimulating': 6,
'drinks': 5,
'stimulating drinks': 7}
counts_df
Out[12]:
bitter coffee colored darkly darkly colored drinks stimulating \
0 1 1 1 1 1 0 1
1 0 0 0 0 0 1 0
stimulating drinks
0 0
1 0
关于python - 如何从nGrams列表中加载Count Vectorizer?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59282495/