本文介绍了词典列表 - 每个文件跟踪单词频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我已经写了一些代码来计算多个文本文件中的单词频率,并将它们存储在字典中。我一直在试图找到一种方法来保持运行每个单词的每个文件的总计数为:
word1 [1] [20] [30] [22]
word2 [5] [7] [0] [4]
我试过使用计数器,尚未找到适合的方法/数据结构。
import string
from collections import defaultdict
从集合导入计数器
import glob
import os
#要删除的字词
noise_words_set = {''''' ''','a','in','is',... etc ...}
#查找文件
path = rC:\\ $
os.chdir(路径)
print(处理文件...)
在glob.glob(*。txt)中的文件:
#读取文件
txt = open({} \ {}。format(path,file),'r',encoding =ut f8)read()
#删除标点符号
在string.punctuation中的标记:
txt = txt.replace(punct,)
#分割成单词并使小写
words = [item.lower()for txt.split()]
#删除不间断的单词
words = [如果w不在noise_words_set中,用w表示w)
#为单词的字典编写
D = defaultdict(int)
单词中的单词:
D [ word] + = 1
#添加到一些数据结构(?),保持每个文件的计数
#... word1 [1] [20] [30] [22]
#... word2 [5] [7] [0] [4]
解决方案
使用几乎整个结构!
从集合导入计数器
files = dict()#这可能比列表更好,tbh
table = str.maketrans('','',string.punctuation)
glob.glob(*。txt):
打开(文件)为f:
word_count = Counter()
在f中的行
word_count + = Counter([word.lower()for word.translate(table)if
word not in noise_words_set]
文件[file] = word_count#如果列表:files.append(word_count)
如果你希望他们翻译成某些字典,然后执行此操作
words_count = dict()
用于文件中的文件:
for word,value in file.items():
try:words_count [word] .append(value)
除了KeyError:words_count [word] = [value]
I have written some code to count word frequency in multiple text files and store them in a dictionary.
I have been trying to find a method to keep a running total per file of counts for each word in a form something like:
word1 [1] [20] [30] [22]word2 [5] [7] [0] [4]
I have tried using counters but I've not been able to find an appropriate method/data structure for this yet.
import string
from collections import defaultdict
from collections import Counter
import glob
import os
# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}
# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")
for file in glob.glob("*.txt"):
# Read file
txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()
# Remove punctuation
for punct in string.punctuation:
txt = txt.replace(punct,"")
# Split into words and make lower case
words = [item.lower() for item in txt.split()]
# Remove unintersting words
words = [w for w in words if w not in noise_words_set]
# Make a dictionary of words
D = defaultdict(int)
for word in words:
D[word] += 1
# Add to some data structure (?) that keeps count per file
#...word1 [1] [20] [30] [22]
#...word2 [5] [7] [0] [4]
解决方案
Using almost your entire structure!
from collections import Counter
files = dict() # this may be better as a list, tbh
table = str.maketrans('','',string.punctuation)
for file in glob.glob("*.txt"):
with open(file) as f:
word_count = Counter()
for line in f:
word_count += Counter([word.lower() for word in line.translate(table) if
word not in noise_words_set])
files[file] = word_count # if list: files.append(word_count)
If you want them translated to some dictionary, do this afterwards
words_count = dict()
for file in files:
for word,value in file.items():
try: words_count[word].append(value)
except KeyError: words_count[word] = [value]
这篇关于词典列表 - 每个文件跟踪单词频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!