本文介绍了python openpyxl读取excel太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.xlsx文件,拥有1841行.每行有30列.使用openpyxl将整个信息保存到列表中.

I have a .xlsx file, owning 1841 rows. Each row has 30 columns.Use openpyxl saving the whole info to a list.

def get_value(i,ws,article_row):
    value=ws.cell(row=article_row,column=i).value
    i+=1
    return value,i
def geneList(f):
    wb = load_workbook(filename=f, read_only=True)
    ws = wb.worksheets[0]
    max_row = ws.max_row
    info_list=[]
    for ar_row in range(2,max_row+1):
        i=3#start from the 3rd column: sourceid
        sourceID,i=get_value(i,ws,ar_row)
        pv,i=get_value(i,ws,ar_row)
        pubtype,i=get_value(i,ws,ar_row)
......
        item_dict={'sourceID':sourceID,'pv':pv,'pubtype':pubtype,
                   ...}
    info_list.append(item_dict)

但是读取所有信息大约需要20分钟.不知该如何解决?解析信息很慢.将.xlsx转换为.csv?还是使用其他工具进行阅读?

But it takes around 20 mins to read all the information. I do not know how to solve this problem? It is to slow to parse the information. Convert .xlsx to .csv? Or use another tool to read?

推荐答案

该问题与cell方法在只读模式下的广泛使用有关.在只读模式下,openpyxl按需读取相关工作表以减少内存使用量,但是这意味着对于每次查找,都会再次解析XML.您重写的代码迫使openpyxl为每行中的每个单元格重新解析文件,这显然很慢.这也是完全不必要的,因为有一个基于行的访问的API.只需将ws.iter_rows()与相关的分隔符一起使用即可获取所需的单元格.

The problem is related to the extensive use of the cell method in read-only mode. In read-only mode openpyxl reads the relevant worksheet on-demand to reduce memory use low but means that for every lookup the XML will be parsed again. The code you have rewritten forces openpyxl to reparse the file for every cell in every row which is obviously slow. This is also entirely unnecessary because there is an API for row-based access. Just use ws.iter_rows() with the relevant delimiters to get cells you want.

这篇关于python openpyxl读取excel太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 11:49