问题描述
我正在尝试根据每个文件中的特定字段合并两个CSV文件。
file1.csv
id ,attr1,attr2,attr3
1,True,7,Purple
2,False,19.8,Cucumber
3,False,-0.5,因为它有一个
4,True,2,Nope
5,True,4.0,Tuesday
6,False,1,Failure
file2.csv
id,attr4,attr5,attr6
2,python,500000.12,False
5,程序,3,True
3,另一个字符串 ,-5,False
这是我使用的代码:
import csv
从集合import OrderedDict
with open('file2.csv','r')as f2:
reader = csv.reader(f2)
fields2 = next(reader,None)#跳过标题
dict2 = {row [0]:row [1: b
$ b with open('file1.csv','r')as f1:
reader = csv.reader(f1)
fields1 = next(reader,None)#跳过标题
dict1 = OrderedDict(读取器中行的(row [0],row [1:]))
result = OrderedDict()
for d in(dict1,dict2) :
for key,value in d.iteritems():
result.setdefault(key,[])extend(value)
with open('merged.csv' ,'wb')as f:
w = csv.writer(f)
for key,value in result.iteritems():
w.writerow([key] + value)
我得到这样的输出,它合适的合并,但没有相同数量的属性的所有行:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,带有逗号的字符串,因为它有一个,另一个字符串,-5,False
4,True,2,Nope
5,True,星期二,节目,3,True
6,False,1,失败
file2
不会在 file1
中的每个 id
我希望输出在合并文件中具有来自 file2
的空字段。例如, id
1将如下所示:
1,True ,7,Purple ,,,
如何将空字段添加到没有数据 file2
,以便合并的CSV中的所有记录具有相同的属性数量?
如果我们不使用 pandas
,我会重构
import csv
from collections import OrderedDict
filenames =file1.csv,file2.csv
data = OrderedDict
fieldnames = []
文件名中的文件名:
with open(filename,rb)as fp:#python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
对于读取器中的行:
data.setdefault(row [id],{})。update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open(merged.csv,wb)as fp:
writer = csv.writer(fp)
writer.writerow (fieldnames)
for data.itervalues():
writer.writerow([field.get(field,'')for fieldnames])
它提供
id,attr1,attr2 ,attr3,attr4,attr5,attr6
1,True,7,Purple ,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False, 带有逗号的字符串,因为它有一个,另一个字符串,-5,False
4,True,2,Nope ,,,
5,True,4.0,星期二,程序, True
6,False,1,Failure ,,,
c $ c> pandas 等同物将是类似
df1 = pd.read_csv(file1。 csv)
df2 = pd.read_csv(file2.csv)
merged = df1.merge(df2,on =id,how =outer)。fillna b $ b merged.to_csv(merged.csv,index = False)
到我的眼睛,意味着你可以花更多的时间处理你的数据,更少的时间重新发明轮子。
I am attempting to merge two CSV files based on a specific field in each file.
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
This is the code I am using:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2
will not have a record for every id
in file1
. I'd like the output to have empty fields from file2
in the merged file. For example, id
1 would look like this:
1,True,7,Purple,,,
How can I add the empty fields to records that don't have data in file2
so that all of my records in the merged CSV have the same number of attributes?
If we're not using pandas
, I'd refactor to something like
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
which gives
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
For comparison, the pandas
equivalent would be something like
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.
这篇关于如何根据字段合并两个CSV文件并在每个记录上保留相同数量的属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!