问题描述
我有两个文件:
- 文件与字符串(新行终止)
- 文件与整数(每行一个)
我想从通过在第二文件中的行索引的第一文件打印行。我目前的解决办法是做到这一点。
而读索引
做
SED -n $ {指数} p $文件1
完成< $文件2
这基本上是由行读取索引文件线和运行的sed打印该特定行。的问题是,它是对大索引文件(千万。行)缓慢
是否有可能做到这一点更快?我怀疑AWK可以在这里很有用。
我所以搜索到我的最好的,但只能找到人试图打印行范围,而不是通过索引第二个文件。
更新
该指数一般不打乱。预计对出现在由索引在索引文件中定义的命令行。
例如:
文件1:
这是1号线
这是2号线
这是3号线
这是4号线
文件2:
3
2
预期的输出是:
这是3号线
这是2号线
如果我理解正确,那么
的awk'NR == FNR {选择[$ 1] = 1;接下来}选择[FNR]'INDEXFILE数据文件
应该工作,假设该指数是按升序排序,或者你想在自己的数据文件以打印不管指数的排序方式下的线条。这种工作方式如下:
NR == FNR {#在处理第一个文件
选择[$ 1] = 1#记得,如果指数被视为
接下来#和什么也不做
}
选择[FNR]#之后,选择(打印)选择的线路。
如果索引未排序并且线应以它们出现在索引的顺序进行打印:
NR == FNR {#处理索引:
++计数器
IDX [$ 0] =#柜台记得在哪个位置,你看到的
接下来的#索引
}
FNR在IDX {#处理数据文件时:
行[IDX [FNR] = $ 0#由的位置记得选线
}#索引
END {#,并在年底时:依次打印出来。
对于(i = 1; I< =计数器; ++ I){
打印行[I]
}
}
这后可以被内联,以及(以分号 ++计数器
和指数[FNR] =柜台
,但我很可能把它放在一个文件,说 foo.awk
,并运行的awk -f foo.awk INDEXFILE数据文件
。有了一个索引文件
1
4
3
和数据文件
一号线
2号线
3号线
4号线
这将打印
一号线
4号线
3号线
剩下需要说明的是,这里假设在索引中的条目是唯一的。如题,也是一个问题,你必须要记住的索引位置的列表,将它在扫描数据文件并记住每个位置的线条。这就是:
NR == FNR {
++计数器
IDX [$ 0] = IDX [$ 0]反#这里要记住名单
下一个
}
FNR在IDX {
斯普利特(IDX [FNR],POS)#拆分名单
对(在POS P){
行[POS [P]] = $ 0#并记住线
#在其中的所有位置。
}
}
结束 {
对于(i = 1; I< =计数器; ++ I){
打印行[I]
}
}
本,最后是问题的code的功能等价物。多么复杂,你必须去为你的使用情况是你必须做出决定。
I have two files:
- File with strings (new line terminated)
- File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter
and index[FNR] = counter
, but I'd probably put it in a file, say foo.awk
, and run awk -f foo.awk indexfile datafile
. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
这篇关于通过第二个文件索引打印线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!