本文介绍了Unix连接:返回不匹配的列而不丢失列顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Unix命令行实用程序(例如"join")连接两个文件,它们共享一个已经排序的公共标识符列.我想保留不匹配的行,并且还要保持匹配和不匹配的行之间的列顺序的完整性,即使标识符列不在第一行中也是如此.

I'd like to join two files, which share a common identifier column that is already sorted, using a Unix command-line utility such as "join". I'd like to keep unmatched rows, and also maintain the integrity of the column order between matched and unmatched rows, even when the identifier column is not in the first row.

例如,考虑两个文件1.txt和2.txt:

For example, consider two files, 1.txt and 2.txt:

val1,val2,key
1a,1b,1
2a,2b,2
3a,3b,3

2.txt

key,val3,val4
1,1c,3d
3,3c,3d

然后,我想要的输出是:

Then, my desired output is:

key,val1,val2,val3,val4
1,1a,1b,1c,3d
2,2a,2b
3,3a,3b,3c,3d

仅限于匹配的行时,join -t, -1 3 1.txt 2.txt之类的东西可以满足我的要求:

Something like join -t, -1 3 1.txt 2.txt does what I want when limited to matched rows:

key,val1,val2,val3,val4
1,1a,1b,1c,3d
3,3a,3b,3c,3d

但由于行不匹配(至少在OSX上不行)而失败:join -a 1 -t, -1 3 1.txt 2.txt扭曲了列顺序(请注意第2列的键在第3列而不是第1列中的方式):

but it fails with unmatched rows (at least on OSX): join -a 1 -t, -1 3 1.txt 2.txt distorts the column order (notice how row 2's key is in column 3, not column 1):

key,val1,val2,val3,val4
1,1a,1b,1c,3d
2a,2b,2
3,3a,3b,3c,3d

在类似Unix的环境中获得所需结果的最简单方法是什么?

What's the easiest way to achieve the result I'm looking for, in a Unix-like environment?

也许这是join中的错误(我看不出为什么我要寻找的东西在所有情况下都不是首选的行为,但我肯定会丢失一些东西).如果是这样,我很乐意为您解决...

Perhaps this is a bug in join (I can't see any reason why what I'm looking for wouldn't be the preferred behavior in all cases, but I certainly could be missing something). If that's the case, I'd be happy to help fix...

推荐答案

我相信您需要指定输出列以获得所需的结果:

I believe you need to specify the output columns to get the result you desire:

$ join -a 1 -t, -1 3 -o 0,1.1,1.2,2.2,2.3 1.txt 2.txt
key,val1,val2,val3,val4
1,1a,1b,1c,3d
2,2a,2b,,
3,3a,3b,3c,3d
$

-o 0是连接列;其他是file.field数字.请注意,它包含缺少值的空白字段(末尾为双,,).如果这是一个主要问题,则可以明显地删除尾随(重复)逗号,而在输出行中间,显然可以删除除了重复的逗号之一以外的所有逗号.我可以通过sed来完成输出.

-o 0 is the join column; the others are file.field numbers. Note that it includes empty fields for the missing values (the double ,, at the end). If that's a major problem, you can obviously delete trailing (repeated) commas, and a little less obviously delete all but one of repeated commas in the middle of an output line. I'd feed the output through sed to do that.

在Mac OS X 10.11.4上同时使用join的BSD(/usr/bin/join)和GNU(自制的-它恰好在/opt/gnu/bin/join中)版本进行测试.

Test on Mac OS X 10.11.4 with both the BSD (/usr/bin/join) and GNU (home built — it happens to be in /opt/gnu/bin/join) versions of join.

这篇关于Unix连接:返回不匹配的列而不丢失列顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 00:34