样本csv

time,type,-1,
time,type,0,w
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,blue,font,13
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,yellow,font,9
time,type,19,b,12
type,19,b,42


我想将以下每个“ type,1”,“ type,5”,“ type,11”,“ type,19”过滤到一个单独的熊猫框架中,以进行进一步分析。最好的方法是什么? [此外,我将忽略“ type,0”和“ type,-1”]

样例代码

import pandas as pd

type1_header = ['type','a','b','c','name']
type5_header = ['type','r','s','t','u','style','font']
type11_header = ['type','a','c']
type19_header = ['type','b']

type1_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10] , names=type1_header)
type5_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10,12,14] , names=type5_header)

最佳答案

import pandas as pd

headers = {1:['a','b','c','name'],
           5:['r','s','t','u','style','font'],
}

usecols = {1:[4,6,8,10],
           5:[4,6,8,10,12,14],
           }


frames = {}
for h in headers:
    frames[h] = pd.DataFrame(columns=headers[h])

count = 0
for line in open('irreg.csv'):
    row = line.split(',')
    count += 1
    ID = int(row[2])
    row_subset = []
    if ID in frames:
        for col in usecols[ID]: row_subset.append(row[col])
        frames[ID].loc[len(frames[ID])] = row_subset
    else:
        print('WARNING: line %d: type %s not found'%(count, row[2]))


虽然这样做了,但是您多久执行一次操作,以及数据多久更改一次?对于一次性的文件,拆分传入的csv文件可能是最容易的,例如通过

 grep type,19 irreg.csv > 19.csv


在命令行中,然后根据其标题和usecols导入每个csv。

关于python - 如何从多个数据类型的组合csv中将数据过滤到唯一的pandas数据帧中?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31196760/

10-12 21:33
查看更多