如何使用 pandas/sklearn 删除停止短语/停止 ngrams(多字串)?

本文介绍了如何使用 pandas/sklearn 删除停止短语/停止 ngrams(多字串)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想防止某些短语进入我的模型.例如，我想防止红玫瑰"进入我的分析.我了解如何添加单个停用词，如将单词添加到 scikit-这样做可以了解 CountVectorizer 的停止列表:

I want to prevent certain phrases for creeping into my models. For example, I want to prevent 'red roses' from entering into my analysis. I understand how to add individual stop words as given in Adding words to scikit-learn's CountVectorizer's stop list by doing so:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

然而，这也会导致无法检测到其他 ngram，例如红郁金香"或蓝玫瑰".

However, this also results in other ngrams like 'red tulips' or 'blue roses' not being detected.

我正在构建一个 TfidfVectorizer 作为我的模型的一部分，我意识到在此阶段之后可能必须输入我需要的处理，但我不确定如何执行此操作.

I am building a TfidfVectorizer as part of my model, and I realize the processing I need might have to be entered after this stage but I am not sure how to do this.

我的最终目标是对一段文本进行主题建模.这是一段代码(几乎直接从 https://de.dariah.eu/tatom/topic_model_python.html#index-0 )，我正在处理:

My eventual aim is to do topic modelling on a piece of text. Here is the piece of code (borrowed almost directly from https://de.dariah.eu/tatom/topic_model_python.html#index-0 ) that I am working on:

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

编辑

示例数据框(我尝试插入尽可能多的边缘情况)，df:

Sample dataframe (I have tried to insert as many edge cases as possible), df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.

推荐答案

TfidfVectorizer 允许自定义预处理器.您可以使用它来进行任何需要的调整.

TfidfVectorizer allows for a custom preprocessor. You can use this to make any needed adjustments.

例如，要从示例语料库中删除所有出现的连续red"+roses"标记(不区分大小写)，请使用:

For example, to remove all occurrences of consecutive "red" + "roses" tokens from your example corpus (case-insensitive), use:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

现在 vocab 删除了所有 red roses 引用.

Now vocab has all red roses references removed.

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']

更新(每个评论线程):

要将所需的停用词和自定义停用词一起传递给包装函数，请使用:

To pass in desired stop phrases along with custom stop words to a wrapper function, use:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

这篇关于如何使用 pandas/sklearn 删除停止短语/停止 ngrams(多字串)?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Pandas