问题描述
我有4 GB的文件,我需要在做一些操作。我有一个Bash脚本要做到这一点,但它似乎猛砸不适合阅读大的数据文件到一个数组。所以,我决定分手我使用awk文件。
I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.
我现在的脚本是:
for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done
不过这个脚本的问题是,它会和循环通读这个大文件100次(其中presumably将导致对IO的400GB)。
However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).
问题:有没有更好的大文件一旦在阅读策略?也许做书面文件的awk内而不是重定向它的输出?
QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?
推荐答案
假设 binsize
是你每块需要的行数,你可以只维护和重置线柜台为您逐步通过文件和awk中设置,而不是使用shell重定向的备用输出文件。
Assuming binsize
is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.
awk -v binsize=60000 '
BEGIN {
outfile="output_chunk_1.txt"
}
count > binsize {
if (filenum>1) {
close(outfile)
}
filenum++
outfile="output_chunk_" filenum ".txt"
count=0
}
{
count++
print > outfile
}
' my_large_file.txt
我没有实际测试过这个code,因此,如果不逐字工作,至少也应该给你使用策略的想法。 : - )
I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)
我们的想法是,只要我们的一大块行数超过 binsize
我们将逐步通过文件,更新文件名中的变量。请注意,关闭(OUTFILE)
不是绝对必要的,因为当然,awk将关闭所有打开的文件,退出的时候,但它可以节省你的每一个记忆几个字节打开的文件句柄(其中,如果你有很多很多的输出文件将只显著)。
The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize
. Note that the close(outfile)
isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).
这是说,你可以做几乎同样的事情,仅击:
That said, you could do almost exactly the same thing in bash alone:
#!/usr/bin/env bash
binsize=60000
filenum=1; count=0
while read -r line; do
if [ $count -gt $binsize ]; then
((filenum++))
count=0
fi
((count++))
outfile="output_chunk_${filenum}.txt"
printf '%s\n' "$line" >> $outfile
done < my_large_file.txt
(也未经测试。)
(Also untested.)
虽然我的期望的awk的解决方案比bash的更快,它可能不会伤害做你自己的基准。 :)
And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)
这篇关于有效分割大文件(目前使用AWK)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!