问题描述
我有一个URL列表,我很想将它们读到dask数据框中一次,但read_csv
似乎不能对http
使用星号.有什么方法可以实现?
I have a list of the URLs, and I'd love to read them to the dask data frameat once, but it looks like read_csv
can't use an asterisk for http
. Is there any way to achieve that?
这里是一个例子:
link = 'http://web.mta.info/developers/'
data = [ 'data/nyct/turnstile/turnstile_170128.txt',
'data/nyct/turnstile/turnstile_170121.txt',
'data/nyct/turnstile/turnstile_170114.txt',
'data/nyct/turnstile/turnstile_170107.txt'
]
我想要的是
df = dd.read_csv('XXXX*X')
推荐答案
尝试使用黄昏.延迟将您的每个网址变成一个懒惰的熊猫数据框,然后使用 dask.dataframe.from_delayed 将那些惰性值转换为完整的dask数据帧
Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe
import pandas as pd
import dask
import dask.dataframe as dd
dfs = [dask.delayed(pd.read_csv)(url) for url in urls]
df = dd.from_delayed(dfs)
这将立即读取您的链接中的一个,以便找出元数据(列,dtypes).如果您提前知道这些dtype和链接,则可以通过将示例空数据帧传递到dd.from_delayed(..., meta=sample_df)
This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)
另请参见: http://dask.pydata.org/en/latest /delayed-collections.html
这篇关于如何从网址列表创建Dask DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!