问题描述
TL; DR :如何从分布式读取中收集元数据(解析过程中的错误)到dask数据框集合中.
TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
我目前有一种专有文件格式,可用于输入dask.DataFrame.我有一个接受文件路径并返回pandas.DataFrame的函数,该函数由dask.DataFrame内部成功使用,可以将多个文件加载到同一dask.DataFrame中.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
直到最近,我一直在使用自己的代码将多个pandas.DataFrames合并为一个,现在我正在尝试使用dask.解析文件格式时,我可能会遇到错误和某些条件,我想记录并与dask.DataFrame对象关联为元数据(日志,数据来源等).
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
请注意,在合理的情况下,我会大量使用MultiImdices(13个索引级别,3个列级别).对于描述整个数据框而不是特定行的元数据,我使用的是属性.
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
使用自定义函数,我可以将元数据与实际的DataFrame一起传递到元组中.使用熊猫,我可以将其添加到_metadata字段中,并作为DataFrame对象的属性.使用dask框架时,如何从单独的pandas.DataFrame对象收集元数据?
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
谢谢!
推荐答案
这里有一些潜在的问题:
There are a few potential questions here:
- Q :如何将自定义格式的许多文件中的数据加载到单个dask数据框中
-
A :您可以检出
dask.delayed
来加载数据,并检出dask.dataframe.from_delayed
来将多个daskDelayed
对象转换为单个dask数据帧.或者,就像您现在可能正在做的那样,您可以使用dask.dataframe.from_pandas
和dask.dataframe.concat
.请参阅此示例笔记本有关使用自定义对象/函数延迟的dask.
- Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the
dask.delayed
to load data anddask.dataframe.from_delayed
to convert several daskDelayed
objects into a single dask dataframe. Or, as you're probably doing now, you can usedask.dataframe.from_pandas
anddask.dataframe.concat
. See this example notebook on using dask.delayed from custom objects/functions.
Q :如何将任意元数据存储到dask.dataframe中?
A :不支持.通常,如果可能的话,我建议使用其他数据结构来存储您的元数据.如果有很多用例,那么我们应该考虑将其添加到dask数据帧中.如果是这种情况,请提出问题.一般认为,最好在dask.dataframe考虑支持之前在Pandas中获得更好的支持.
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q :我在Pandas中大量使用多索引,如何将该工作流集成到dask.dataframe中?
这篇关于从dask数据框提供程序收集属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!