我犯了一个愚蠢的错误,那就是不对我的计数矢量化器进行腌制,而是列出了它所产生的所有nGrams,例如3500个功能。

现在我的问题是我需要从此nGrams列表中加载countVectorizer模型,无论如何我可以这样做吗?当前,该列表位于pd.dataframe中。

我希望我可以做类似的事情

CV = CountVectorizer(“ loadMyListofnGrams”)

任何帮助将非常感激!

最佳答案

您可以通过使用n元语法列表训练CountVectorizer来实现此目的。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

ngrams = ['coffee', 'darkly', 'darkly colored', 'bitter', 'stimulating',
          'drinks', 'stimulating drinks']

new_docs = [
           'Coffee is darkly colored, bitter, slightly acidic and \
            has a stimulating effect in humans, primarily due to its \
            caffeine content.[3] ',
            'It is one of the most popular drinks \
            in the world,[4] and it can be prepared and presented in a \
            variety of ways (e.g., espresso, French press, caffè latte). '
            ]

# Instantiate CountVectorizer and train it with your ngrams
cv = CountVectorizer(ngram_range=(1, 2))
cv.fit(ngrams)
cv.vocabulary_

# Apply the vectorizer to new documents and display the dense matrix
counts = cv.transform(new_docs)
counts.A

# Turn the results into a data frame
counts_df = pd.DataFrame(counts.A, columns=cv.get_feature_names())
counts_df


输出量

cv.vocabulary_
Out[10]:
{'coffee': 1,
 'darkly': 3,
 'colored': 2,
 'darkly colored': 4,
 'bitter': 0,
 'stimulating': 6,
 'drinks': 5,
 'stimulating drinks': 7}

counts_df
Out[12]:
   bitter  coffee  colored  darkly  darkly colored  drinks  stimulating  \
0       1       1        1       1               1       0            1
1       0       0        0       0               0       1            0

   stimulating drinks
0                   0
1                   0

关于python - 如何从nGrams列表中加载Count Vectorizer?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59282495/

10-12 22:11