Dask数据框：读取多个文件在列中存储文件名 | 读取多个文件在列中存储文件名

本文介绍了Dask数据框：读取多个文件在列中存储文件名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常使用 dask.dataframe 读取多个文件，如下所示：

I regularly use dask.dataframe to read multiple files, as so:

import dask.dataframe as dd

df = dd.read_csv('*.csv')

但是，每一行的起源（即读取数据的文件）似乎永远丢失了。

However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.

将其添加为列的方式，例如 df.loc [：100，'partition'] ='file1.csv'如果 file1.csv 是第一个文件，并包含100行。当将 compute 作为工作流的一部分触发时，这将应用于读取到数据框中的每个分区 /文件。

Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.

想法是可以根据源应用不同的逻辑。

The idea is that different logic can then be applied depending on the source.

推荐答案

Dask函数，和现在包括参数 include_path_column ：

Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:

include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.

这篇关于Dask数据框：读取多个文件在列中存储文件名的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！