问题描述
我经常使用 dask.dataframe
读取多个文件,如下所示:
I regularly use dask.dataframe
to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
但是,每一行的起源(即读取数据的文件)似乎永远丢失了。
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
将其添加为列的方式,例如 df.loc [:100,'partition'] ='file1.csv'
如果 file1.csv
是第一个文件,并包含100行。当将 compute
作为工作流的一部分触发时,这将应用于读取到数据框中的每个分区 /文件。
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv'
if file1.csv
is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute
is triggered as part of a workflow.
想法是可以根据源应用不同的逻辑。
The idea is that different logic can then be applied depending on the source.
推荐答案
Dask函数,和现在包括参数 include_path_column
:
Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column
:
include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
这篇关于Dask数据框:读取多个文件在列中存储文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!