问题描述
例如,我有两个的CSV文件,
0.csv
For example, I have two csv files,0.csv
100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c
和
1.csv
and1.csv
100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh
和我想的输出成为像
100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh
我如何做,在UNIX Shell脚本或Perl?我知道Unix的加盟命令,将与小文件的工作。例如,要得到我的结果,我也只是做
How do I do that in Unix shell script or Perl? I know the unix "join" command, and that would work well with the small files. For example to get my result I could just do
join -t , -a 1 -a 2 -1 1 -2 1 -o 0 1.2 1.3 1.4 1.5 2.2 2.3 -e "0" 0.csv 1.csv
但是这不是我的目的是可行的,因为我的实际数据文件有超过一百万的列(在千兆字节总数据大小),因此我的UNIX命令也将是长期的超过一百万字。这可能是最重要的头痛,效率低下code被陷入了相当快的。
but that is not feasible for my purposes, since my actual data file has more than a million columns (total data size in the gigabytes), and thus my unix command would also be more than a million characters long. This might be the most important headache, as inefficient code gets bogged down quite fast.
另外请注意,我需要的占位符0,每当有丢失数据。这prevents我从单纯使用这种
Also note that I need the placeholder character "0" whenever there is missing data. This prevents me from simply using this
join -t , -a 1 -a 2 -1 1 -2 1 0.csv 1.csv
也是初学Perl程序员,所以一些细节真的欢迎。我想preFER解决方案是Perl或shell脚本,但实际上任何工作就可以了。
Also a beginner Perl programmer, so some details really welcome. I'd prefer the solution to be perl or shell script, but really anything that works would be fine.
推荐答案
如果你可以添加一个头的每个文件,那么你可以使用的来解决这个问题。例如:
If you can add a header to each file, then you could use tabulator to solve the problem. Example:
0.csv:
key,letter_1,letter_2,letter_3,letter_4
100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c
1.csv:
key,name_1,name_2
100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh
然后 tbljoin -lr -n 0 0.csv 1.csv
产生
key,letter_1,letter_2,letter_3,letter_4,name_1,name_2
100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh
需要注意的是(相对于纯UNIX 加入
命令),输入文件不需要进行排序;此外,您不必担心内存的消耗,因为实现是基于UNIX的排序,并会采取基于文件的合并排序对大文件。
Note that (in contrast to pure unix join
command), the input files don't need to be sorted; also, you don't need to worry about memory consumption, since the implementation is based on unix sort, and will resort to file-based merge sort for large files.
这篇关于合并与公共列非常大的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!