读取多个文件在列中存储文件名

读取多个文件在列中存储文件名

本文介绍了Dask数据框:读取多个文件在列中存储文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常使用 dask.dataframe 读取多个文件,如下所示:

I regularly use dask.dataframe to read multiple files, as so:

import dask.dataframe as dd

df = dd.read_csv('*.csv')

但是,每一行的起源(即读取数据的文件)似乎永远丢失了。

However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.

将其添加为列的方式,例如 df.loc [:100,'partition'] ='file1.csv'如果 file1.csv 是第一个文件,并包含100行。当将 compute 作为工作流的一部分触发时,这将应用于读取到数据框中的每个分区 /文件。

Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.

想法是可以根据源应用不同的逻辑。

The idea is that different logic can then be applied depending on the source.

推荐答案

Dask函数,和现在包括参数 include_path_column

Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:

include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.

这篇关于Dask数据框:读取多个文件在列中存储文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 13:38