本文介绍了如果列不存在,则将空白行追加到数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题有点奇怪和复杂,请多多包涵.

This question is kind of odd and complex, so bear with me, please.

我有几个要用pandas导入的海量CSV文件(GB大小).这些CSV文件是由数据采集系统收集的数据转储,我不需要其中的大部分,因此我正在使用usecols参数过滤掉相关数据.问题在于,并非所有CSV文件都具有我需要的所有列(正在使用的数据系统的属性).

I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).

问题在于,如果文件中不存在该列,但在usecols中指定了该列,则read_csv会引发错误.

The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.

是否有一种简单的方法来强制在数据帧中设置指定的列并让pandas仅在该列不存在时才返回空白行?我考虑过遍历每个文件的每一列,并将结果序列处理到数据帧中,但这似乎效率低下而且笨拙.

Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

推荐答案

假设某种主列表all_cols_to_use,您可以执行以下操作吗:

Assuming some kind of master list all_cols_to_use, can you do something like:

def parse_big_csv(csvpath):
    with open(csvpath, 'r') as infile:
        header = infile.readline().strip().split(',')
        cols_to_use = sorted(set(header) & set(all_cols_to_use))
        missing_cols = sorted(set(all_cols_to_use) - set(header))
    df = pd.read_csv(csvpath, usecols=cols_to_use)
    df.loc[:, missing_cols] = np.nan
    return df

这假定您可以用np.nan填充缺少的列,但应该可以. (此外,如果要连接数据帧,丢失的列将位于最后一个df中,并适当地填充np.nan.)

This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

这篇关于如果列不存在,则将空白行追加到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 00:47