有效分割大文件（目前使用AWK）

本文介绍了有效分割大文件（目前使用AWK）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有4 GB的文件，我需要在做一些操作。我有一个Bash脚本要做到这一点，但它似乎猛砸不适合阅读大的数据文件到一个数组。所以，我决定分手我使用awk文件。

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.

我现在的脚本是：

for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done

不过这个脚本的问题是，它会和循环通读这个大文件100次（其中presumably将导致对IO的400GB）。

However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).

问题：有没有更好的大文件一旦在阅读策略？也许做书面文件的awk内而不是重定向它的输出？

QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?

推荐答案

假设 binsize 是你每块需要的行数，你可以只维护和重置线柜台为您逐步通过文件和awk中设置，而不是使用shell重定向的备用输出文件。

Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.

awk -v binsize=60000 '
  BEGIN {
    outfile="output_chunk_1.txt"
  }
  count > binsize {
    if (filenum>1) {
      close(outfile)
    }
    filenum++
    outfile="output_chunk_" filenum ".txt"
    count=0
  }
  {
    count++
    print > outfile
  }
' my_large_file.txt

我没有实际测试过这个code，因此，如果不逐字工作，至少也应该给你使用策略的想法。： - ）

I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)

我们的想法是，只要我们的一大块行数超过 binsize 我们将逐步通过文件，更新文件名中的变量。请注意，关闭（OUTFILE）不是绝对必要的，因为当然，awk将关闭所有打开的文件，退出的时候，但它可以节省你的每一个记忆几个字节打开的文件句柄（其中，如果你有很多很多的输出文件将只显著）。

The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).

这是说，你可以做几乎同样的事情，仅击：

That said, you could do almost exactly the same thing in bash alone:

#!/usr/bin/env bash

binsize=60000

filenum=1; count=0

while read -r line; do

  if [ $count -gt $binsize ]; then
    ((filenum++))
    count=0
  fi

  ((count++))

  outfile="output_chunk_${filenum}.txt"
  printf '%s\n' "$line" >> $outfile

done < my_large_file.txt

（也未经测试。）

(Also untested.)

虽然我的期望的awk的解决方案比bash的更快，它可能不会伤害做你自己的基准。：）

And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)

这篇关于有效分割大文件（目前使用AWK）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！