python - 从大量停用词中永久删除停用词

我正在对数据集进行一些NLP，并且尝试删除停用词。

我没有使用内置停用词的nltk，而是使用了自定义停用词列表（使用不同语言的大约1万个单词）

我首先定义了以下功能

def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [lm.lemmatize(word) for word in tokens if word not in stopwords]
    return text

然后我将其应用于数据框，如下所示：

df_train['clean_text'] = df_train['question_text'].apply(lambda x: clean_text(x))

我的问题是，处理需要花费很长时间，所以有没有更快的方法呢？

最佳答案

包含对字符串的检查（x in data_structure），并且列表是线性的。这意味着string.punctuation会针对初始text中的每个字符进行迭代，而stopwords则会针对每个令牌进行迭代。将它们都变成集合以使这些检查不变：

punct = set(string.punctuation)
stopwords = set(stopwords)

def clean_text(text):
    text = ''.join(char.lower() for char in text if char not in punct)
    tokens = re.split('\W+', text)
    text = [lm.lemmatize(word) for word in tokens if word not in stopwords]
    return text

一些参考：

https://wiki.python.org/moin/TimeComplexity#set
https://wiki.python.org/moin/TimeComplexity#list

关于python - 从大量停用词中永久删除停用词，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/53935663/