问题描述
我如何获得 N
从无法装入内存非常大的文件随机线。
How can I get n
random lines from very large files that can't fit in memory.
也将是巨大的,如果我能之前或随机分组后添加过滤器。
Also it would be great if I could add filters before or after the randomization.
在我的案件的规格是:
- > 1亿行
- > 10GB文件
- 通常随机批量10000-30000
- 512RAM托管Ubuntu服务器14.10
因此失去了文件中的几行,因为他们有一个1在10000的机会无论如何也不会这么大的问题,但是性能和资源的消耗将是一个问题。
so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem
推荐答案
在这样的限制因素,下面的方法会更好。
In such limiting factors, the following approach will be better.
- 寻求随机位置在文件中(例如,您将内部一些行)
- 从该位置向后去寻找给定行的开始
- 勇往直前,打印完整的行
为此,您需要一个工具,可以在文件中查找,例如 perl的
。
For this you need a tool that can seek in files, for example perl
.
use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256
my($want, $filename) = @ARGV;
my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");
my $buffer;
my $cnt;
while($want > $cnt++) {
my $randpos = int(rand($endpos)); #random file position
my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
$seekpos = 0 if( $seekpos < 0 );
sysseek($fd, $seekpos, SEEK_SET); #seek to position
my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters
my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer
my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
my $lineend = index $buffer, "\n", $linestart; #find the end of line in the buffer
my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;
print "$the_line\n";
}
上面保存到某些文件,例如randlines.pl,并把它作为:
Save the above into some file such "randlines.pl" and use it as:
perl randlines.pl wanted_count_of_lines file_name
例如
perl randlines.pl 10000 ./BIGFILE
该脚本非常低级的IO操作,即,它的非常快即可。 (在我的笔记本,从10M选择30K行了半秒)。
The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).
这篇关于得到在bash大文件随机线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!