本文介绍了有效分割大文件(目前使用AWK)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有4 GB的文件,我需要在做一些操作。我有一个Bash脚本要做到这一点,但它似乎猛砸不适合阅读大的数据文件到一个数组。所以,我决定分手我使用awk文件。

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.

我现在的脚本是:

for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done

不过这个脚本的问题是,它会和循环通读这个大文件100次(其中presumably将导致对IO的400GB)。

However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).

问题:有没有更好的大文件一旦在阅读策略?也许做书面文件的awk内而不是重定向它的输出?

QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?

推荐答案

假设 binsize 是你每块需要的行数,你可以只维护和重置线柜台为您逐步通过文件和awk中设置,而不是使用shell重定向的备用输出文件。

Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.

awk -v binsize=60000 '
  BEGIN {
    outfile="output_chunk_1.txt"
  }
  count > binsize {
    if (filenum>1) {
      close(outfile)
    }
    filenum++
    outfile="output_chunk_" filenum ".txt"
    count=0
  }
  {
    count++
    print > outfile
  }
' my_large_file.txt

我没有实际测试过这个code,因此,如果不逐字工作,至少也应该给你使用策略的想法。 : - )

I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)

我们的想法是,只要我们的一大块行数超过 binsize 我们将逐步通过文件,更新文件名中的变量。请注意,关闭(OUTFILE)不是绝对必要的,因为当然,awk将关闭所有打开的文件,退出的时候,但它可以节省你的每一个记忆几个字节打开的文件句柄(其中,如果你有很多很多的输出文件将只显著)。

The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).

这是说,你可以做几乎同样的事情,仅击:

That said, you could do almost exactly the same thing in bash alone:

#!/usr/bin/env bash

binsize=60000

filenum=1; count=0

while read -r line; do

  if [ $count -gt $binsize ]; then
    ((filenum++))
    count=0
  fi

  ((count++))

  outfile="output_chunk_${filenum}.txt"
  printf '%s\n' "$line" >> $outfile

done < my_large_file.txt

(也未经测试。)

(Also untested.)

虽然我的期望的awk的解决方案比bash的更快,它可能不会伤害做你自己的基准。 :)

And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)

这篇关于有效分割大文件(目前使用AWK)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 06:50