如何在多个文本文件中查找字典中按键的频率

如何在多个文本文件中查找字典中按键的频率

本文介绍了如何在多个文本文件中查找字典中按键的频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该计算个人物品"文档中所有文件中字典"d"的所有键值的出现频率在这里,个人文章"文档大约有20000个txt文件,文件名分别为1,2,3,4 ...例如:假设d [Britain] = [5,76,289]必须返回不列颠出现在属于个人物品"文档的文件5.txt,76.txt,289.txt中的次数,并且我还需要在同一文档中的所有文件中查找频率.我需要将这些值存储在另一个d2中对于同一示例,d2必须包含(Britain,26,1200),其中26是文件5.txt,76.txt和289.txt中的不列颠单词的频率,而1200是所有文件中的不列颠单词的频率.我是python新手,我尝试了很少!请帮忙!

I am supposed to count the frequency of all the key values of dictionary "d" across all the files in the document "individual-articles'Here,the document "individual-articles' has around 20000 txt files,with filenames 1,2,3,4...for ex: suppose d[Britain]=[5,76,289] must return the number of times Britain appears in the files 5.txt,76.txt,289.txt belonging to the document "induvidual articles", and also i need to find its frequency across all the files in the same document. i need to store these values in another d2for the same example,d2 must contain (Britain,26,1200) where 26 is the frequency of the word Britain in the files 5.txt,76.txt and 289.txt and 1200 is the frequency of the word Britain in all the files.I am a python newbie, and i have tried little! please help!!

import collections
import sys
import os
import re
sys.stdout=open('dictionary.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    sorted(text)
    return text


folderpath='d:/individual-articles'
counter=Counter()


filepaths = glob(os.path.join(folderpath,'*.txt'))


d2={}
with open('topics.txt') as f:
    d = collections.defaultdict(list)
    for line in f:
        value, *keys = line.strip().split('~')
        for key in filter(None, keys):
            d[key].append(value)

for filepath in filepaths:
    with open(filepath,'r') as filehandle:
        lines = filehandle.read()
        words = removegarbage(lines).split()
        for k in d.keys():
            d2[k] = words.count(k)

for i in d2.items():
    print(i)

推荐答案

好吧,我不确定"X"文档中所有文件的含义,但我认为它类似于书中的页面.通过这种解释,我将尽最大努力以最简单的方式存储数据.将数据放在易于操作的位置会在以后提高效率,因为您始终可以只添加一种完成方法和所需的输出类型.

Well, I'm not exactly sure what you mean by all the files in the document "X" but I assume it's analogous to pages in a book. With this interpretation, I would do my best to store the the data in the easiest way. Putting data in easily manipulable adds efficiency later, because you can always just add a method for accomplishing and type of output you want.

由于您似乎要查看的主键是关键字,因此我将使用这种结构创建嵌套的python字典

Since it seems the main key you're looking at is keyword, I would create a nested python dictionary with this structure

dict = (keyword:{file:count})

一旦采用这种形式,您就可以非常轻松地对数据进行任何类型的操作.

Once it's in this form, you can do any type of manipulation on the data really easily.

要创建此字典,

import os
# returns the next word in the file
def words_generator(fileobj):
    for line in fileobj:
        for word in line.split():
            yield word
word_count_dict = {}
for dirpath, dnames, fnames in os.walk("./"):
    for file in fnames:
        f = open(file,"r")
        words = words_generator(f)
        for word in words:
            if word not in word_count_dict:
                  word_count_dict[word] = {"total":0}
            if file not in word_count_dict[word]:
                  word_count_dict[word][file] = 0
            word_count_dict[word][file] += 1
            word_count_dict[word]["total"] += 1

这将创建一个易于解析的字典.

This will create an easily parsable dictionary.

英国的总字数是多少?

word_count_dict["Britain"]["total"]

英国在文件74.txt和75.txt中的访问次数是多少?

Want the number of times Britain is in files 74.txt and 75.txt?

sum([word_count_dict["Britain"][file] if file in word_count_dict else 0 for file in ["74.txt", "75.txt"]])

想查看不列颠单词出现的所有文件吗?

Want to see all files that the word Britain shows up in?

[file for key in word_count_dict["Britain"]]

您当然可以编写通过简单的调用执行这些操作的函数.

You can of course write functions that perform these operations with a simple call.

这篇关于如何在多个文本文件中查找字典中按键的频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 20:23