我遇到的问题是,在我的代码中,我无法将单个单词/标记与停用词匹配以从原始文本中删除。相反,我得到了整个句子,因此无法将其与停用词匹配。请向我展示一种获取单个令牌,然后将其与停用词匹配并删除它们的方法。请帮我。
from nltk.corpus import stopwords
import string, os
def remove_stopwords(ifile):
processed_word_list = []
stopword = stopwords.words("urdu")
text = open(ifile, 'r').readlines()
for word in text:
print(word)
if word not in stopword:
processed_word_list.append('*')
print(processed_word_list)
return processed_word_list
if __name__ == "__main__":
print ("Input file path: ")
ifile = input()
remove_stopwords(ifile)
最佳答案
尝试以下方法:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string, os, ast
def remove_stopwords(ifile):
processed_word_list = []
stopword = stopwords.words("urdu")
words = ast.literal_eval(open(ifile, 'r').read())
for word in words:
print(word)
if word not in stopword:
processed_word_list.append('*')
else:
processed_word_list.append(word)
print(processed_word_list)
return processed_word_list
if __name__ == "__main__":
print ("Input file path: ")
ifile = input()
remove_stopwords(ifile)
关于python - 如何在Python中一一读取文件中的 token ?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45617523/