如何减少筛选文章数据集的时间?

本文介绍了如何减少筛选文章数据集的时间?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试过滤包含近5万篇文章的数据集.我想从每篇文章中过滤掉停用词和标点符号.但是该过程需要很长时间.我已经过滤了数据集，花了6个小时.现在，我还有另一个要过滤的数据集，其中包含30万篇文章.

I'm trying to filter my dataset which contains nearly 50K articles. From each article I want to filter out stop words and punctuation. But the process is taking long time. I've already filtered the dataset and it took 6 hours. Now I've got another dataset to filter which contains 300K articles.

我在anaconda环境中使用python. PC配置:第七代Core i5、8GB RAM和NVIDIA 940MX GPU.为了过滤我的数据集，我编写了一个代码，该代码将数据集中的每篇文章都使用了，对单词进行标记，然后删除停用词，标点符号和数字.

I'm using python in anaconda environment. PC configuration: 7th Gen. Core i5, 8GB RAM and NVIDIA 940MX GPU. To filter my dataset I've wrote a code which takes each article in dataset, tokenize words and then remove stop words, punctuations and numbers.

def sentence_to_wordlist(sentence, filters="!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘\u200c–""…‘"):
    translate_dict = dict((c, ' ') for c in filters)
    translate_map = str.maketrans(translate_dict)
    wordlist = sentence.translate(translate_map).split()
    global c,x;
    return list(filter(lambda x: x not in stops, wordlist))

现在，我想减少此过程的时间.有什么方法可以对此进行优化?

Now I want to reduce the time for this process. Is there any way to optimize this?

推荐答案

我一直在尝试优化您的流程:

I've been trying to optimize your process:

from nltk.corpus import stopwords

cachedStopWords = set(stopwords.words("english"))

filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘\u200c–""…‘"
trnaslate_table = str.maketrans('', '', filters)
def sentence_to_wordlist(sentence, filters=filters):
    wordlist = sentence.translate(trnaslate_table).split()
    return [w for w in wordlist if w not in cachedStopWords] 

from multiprocessing.pool import Pool

p = Pool(10)
results  = p.map(sentence_to_wordlist, data)

数据是包含您的文章的列表
- data is a list with your articles
  我一直在使用nltk中的停用词，但是您可以使用自己的停用词，请确保您的停用词是一个集合而不是列表(因为查找元素是否在集合中的时间复杂度为O(1)，在列表中的元素是否为O(n))
  I've been using the stop words from nltk but you can use your own stopwords, please make sure your stopwords is a set not a list (because to find if a element is in a set is O(1) time complexity and in a list is O(n))
  我一直在测试10万篇文章的列表，每篇文章有大约2k个字符，花费了我不到9秒的时间.
  I've been testing with a list of 100k articles, each article having around 2k characters, took me less than 9 seconds.
  
  这篇关于如何减少筛选文章数据集的时间?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！