合并与公共列非常大的CSV文件 | 合并与公共列非常大的CSV文件

本文介绍了合并与公共列非常大的CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如，我有两个的CSV文件，
0.csv

For example, I have two csv files,0.csv

100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c

和
1.csv

and1.csv

100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh

和我想的输出成为像

100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh

我如何做，在UNIX Shell脚本或Perl？我知道Unix的加盟命令，将与小文件的工作。例如，要得到我的结果，我也只是做

How do I do that in Unix shell script or Perl? I know the unix "join" command, and that would work well with the small files. For example to get my result I could just do

join -t , -a 1 -a 2 -1 1 -2 1 -o 0 1.2 1.3 1.4 1.5 2.2 2.3 -e "0" 0.csv 1.csv

但是这不是我的目的是可行的，因为我的实际数据文件有超过一百万的列（在千兆字节总数据大小），因此我的UNIX命令也将是长期的超过一百万字。这可能是最重要的头痛，效率低下code被陷入了相当快的。

but that is not feasible for my purposes, since my actual data file has more than a million columns (total data size in the gigabytes), and thus my unix command would also be more than a million characters long. This might be the most important headache, as inefficient code gets bogged down quite fast.

另外请注意，我需要的占位符0，每当有丢失数据。这prevents我从单纯使用这种

Also note that I need the placeholder character "0" whenever there is missing data. This prevents me from simply using this

join -t , -a 1 -a 2 -1 1 -2 1 0.csv 1.csv

也是初学Perl程序员，所以一些细节真的欢迎。我想preFER解决方案是Perl或shell脚本，但实际上任何工作就可以了。

Also a beginner Perl programmer, so some details really welcome. I'd prefer the solution to be perl or shell script, but really anything that works would be fine.

推荐答案

如果你可以添加一个头的每个文件，那么你可以使用的来解决这个问题。例如：

If you can add a header to each file, then you could use tabulator to solve the problem. Example:

0.csv：

key,letter_1,letter_2,letter_3,letter_4
100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c

1.csv：

key,name_1,name_2
100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh

然后 tbljoin -lr -n 0 0.csv 1.csv 产生

key,letter_1,letter_2,letter_3,letter_4,name_1,name_2
100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh

需要注意的是（相对于纯UNIX 加入命令），输入文件不需要进行排序;此外，您不必担心内存的消耗，因为实现是基于UNIX的排序，并会采取基于文件的合并排序对大文件。

Note that (in contrast to pure unix join command), the input files don't need to be sorted; also, you don't need to worry about memory consumption, since the implementation is based on unix sort, and will resort to file-based merge sort for large files.

这篇关于合并与公共列非常大的CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！