本文介绍了向使用csv_read导入的DASK数据帧的列添加值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设使用csv_read将五个文件导入到DASK.为此,我使用以下代码:

Suppose that five files are imported to the DASK using csv_read. To do this, I use this code:

import dask.dataframe as dd
data = dd.read_csv(final_file_list_msg, header = None)

每个文件都有十列.我想在文件1的第一列中添加1,在文件2的第一列中添加2,在文件3的第一列中添加3,依此类推.

Every file has ten columns. I want to add 1 to the first column of file 1, 2 to the first column of file 2, 3 to the first column of file 3, etc.

推荐答案

假设您遵循此方案有几个文件:

Let assume that you have several files following this scheme:

dummy/
├── file01.csv
├── file02.csv
├── file03.csv

首先,我们通过创建它们

First we create them via

import os
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask import delayed

fldr = "dummy"

if not os.path.exists(fldr):
    os.mkdir(fldr)

for i in range(10):
    df = pd.DataFrame(np.random.rand(5,3))
    df.to_csv("{}/file{:02}.csv".format(fldr,i+1),
              index=False)

创建的文件列表为fns = sorted(os.listdir(fldr))

然后我们编写一个给定路径fn的函数:

Then we write a function that given the path fn:

  • 读取文件
  • fileXX.csv
  • 中取数字XX
  • 在第一列中插入int(XX)
  • read the file
  • takes the number XX in fileXX.csv
  • insert int(XX) on the first column

那是

def addCol(fn):
    df = pd.read_csv(os.path.join(fldr, fn))
    first = int(fn.split(".")[0][-2:])
    df.insert(0, "first", first)
    return df

我们希望这个乐趣成为delayed,我们可以使用装饰器@delayed或使用delayed包装函数来实现它.因此,为了获得所需的输出,我们应该(相应地)触发

We wanted this fun to be delayed and we can achieve it using the decorator @delayed or wrapping the function with delayed. So to obtain the desired output we should fire (accordingly)

  • ddf = dd.from_delayed([addCol(fn) for fn in fns])
  • ddf = dd.from_delayed([delayed(addCol)(fn) for fn in fns])
  • ddf = dd.from_delayed([addCol(fn) for fn in fns])
  • ddf = dd.from_delayed([delayed(addCol)(fn) for fn in fns])

这篇关于向使用csv_read导入的DASK数据帧的列添加值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 08:38