问题描述
我有一个名为my_file
的大文件(5Gb).我有一个名为my_list
的列表.读取文件中每一行的最有效方法是什么,如果my_list
中的项目与my_file
中的行中的项目相匹配,则创建一个名为matches
的新列表,其中包含 AND my_list
中发生匹配的项目.这是我正在尝试做的事情:
I have a large file (5Gb) called my_file
. I have a list called my_list
. What is the most efficient way to read each line in the file and, if an item from my_list
matches an item from a line in my_file
, create a new list called matches
that contains items from the lines in my_file
AND items from my_list
where a match occurred. Here is what I am trying to do:
def calc(my_file, my_list)
matches = []
my_file.seek(0,0)
for i in my_file:
i = list(i.rstrip('\n').split('\t'))
for v in my_list:
if v[1] == i[2]:
item = v[0], i[1], i[3]
matches.append(item)
return matches
这是my_file
中的几行:
lion 4 blue ch3
sheep 1 red pq2
frog 9 green xd7
donkey 2 aqua zr8
这是my_list
intel yellow
amd green
msi aqua
在上面的示例中,所需的输出(列表列表)将是:
The desired output, a list of lists, in the above example would be:
[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]
我的代码目前正在运行,尽管速度很慢.使用生成器或序列化会有所帮助吗?谢谢.
My code is currently work, albeit really slow. Would using a generator or serialization help? Thanks.
推荐答案
您可以构建字典以查找v.我添加了其他一些小的优化方法:
You could build a dictonary for looking up v. I added further little optimizations:
def calc(my_file, my_list)
vd = dict( (v[1],v[0]) for v in my_list)
my_file.seek(0,0)
for line in my_file:
f0, f1, f2, f3 = line[:-1].split('\t')
v0 = vd.get(f2)
if v0 is not None:
yield (v0, f1, f3)
对于大型my_list
,这应该更快.
This should be much faster for a large my_list
.
使用get
的速度比检查i[2]
是否在vd
中+访问vd[i[2]]
Using get
is faster than checking if i[2]
is in vd
+ accessing vd[i[2]]
要获得除这些优化之外的更多加速,我建议 http://www.cython.org
For getting more speedup beyond these optimizations I recommend http://www.cython.org
这篇关于在Python中将列表项与大文件中的行匹配的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!