本文介绍了如何从网址列表创建Dask DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个URL列表,我很想将它们读到dask数据框中一次,但read_csv似乎不能对http使用星号.有什么方法可以实现?

I have a list of the URLs, and I'd love to read them to the dask data frameat once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that?

这里是一个例子:

link = 'http://web.mta.info/developers/'

data = [     'data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt'
        ]

我想要的是

df = dd.read_csv('XXXX*X')

推荐答案

尝试使用黄昏.延迟将您的每个网址变成一个懒惰的熊猫数据框,然后使用 dask.dataframe.from_delayed 将那些惰性值转换为完整的dask数据帧

Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe

import pandas as pd
import dask
import dask.dataframe as dd

dfs = [dask.delayed(pd.read_csv)(url) for url in urls]

df = dd.from_delayed(dfs)

这将立即读取您的链接中的一个,以便找出元数据(列,dtypes).如果您提前知道这些dtype和链接,则可以通过将示例空数据帧传递到dd.from_delayed(..., meta=sample_df)

This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)

另请参见: http://dask.pydata.org/en/latest /delayed-collections.html

这篇关于如何从网址列表创建Dask DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 18:07