通过长时间的awk操作(涉及100多个列)来概括(特定)脚本

本文介绍了通过长时间的awk操作(涉及100多个列)来概括(特定)脚本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请只看一下下面代码中的大量手动输入，而无需了解它:

Please only glance at the abundance of manual input in the code below, no need to understand it:

#!/bin/bash

paste A1.dat A2.dat A3.dat A4.dat A5.dat A6.dat > A.dat

awk '{print ($2 + $21 + $40 + $59 + $78 + $97), ($3 + $22 + $41 + $60 + $79 + $98), ($4 + $23 + $42 + $61 + $80 + $99) + ($6 + $25 + $44 + $63 + $82 + $101) + ($8 + $27 + $46 + $65 + $84 + $103), ($5 + $24 + $43 + $62 + $81 + $100) + ($7 + $26 + $45 + $64 + $83 + $102) + ($9 + $ 28 + $47 + $66 + $85 + $104), ($10 + $29 + $48 + $67 + $86 + $105) + ($12 + $31 + $50 + $69 + $88 + $107) + ($14 + $33 + $52 + $71 + $90 + $109) + ($16 + $35 + $54 + $73 + $92 + $111) + ($18 + $37 + $56 + $75 + $94 + $113), ($11 + $30 + $49 + $68 + $87 + $106) + ($13 + $32 + $51 + $70 + $89 + $108) + ($15 + $34 + $53 + $72 + $91 + $110) + ($17 + $36 + $55 + $74 + $93 + $112) + ($19 + $38 + $57 + $76 + $95 + $114)}' A.dat >> A_full.dat

代码目标:获取存储在 n 个输入文件中的数据，每个文件包含19列数据和相等的行数.以某种方式处理此数据，以生成具有7列数据和与每个输入文件相同的行数的输出文件.

Code objective: Take data stored in n input files each containing 19 columns of data and equal # of rows. Manipulate this data in a certain fashion to generate an output file with 7 columns of data and the same # of rows as each of the input files.

我在上面的代码中所做的:使用 paste 将所有n个输入文件(A?.dat)合并为1个文件(A.dat).接下来，我使用 awk 操纵A.dat中的数据以获取输出文件(A_full.dat).对于较大的n值，这变得不规则且麻烦.

What I did in the code above: Used paste to merge all of the n input files (A?.dat) into 1 file (A.dat). Next, I use awk to manipulate the data in A.dat to get the output file (A_full.dat). This becomes unruly and cumbersome for a large value of n.

我的请求:帮助我概括任何 n 值的代码.我上面发布的代码适用于n = 6的情况.要了解该代码执行的数据操作，请查看下面的n = 2的代码(请参见示例文件后面的说明):

My request: Help me generalize the code for any value of n. The code I've posted above is for when n=6. To understand what data manipulation the code does, please look at the code below for n=2 (see explanation after the sample files):

#!/bin/bash

paste A1.dat A2.dat > A.dat

awk '{print $1, ($2 + $21), ($3 + $22), ($4 + $23) + ($6 + $25) + ($8 + $27), ($5 + $24) + ($7 + $26) + ($9 + $28), ($10 + $29) + ($12 + $31) + ($14 + $33) + ($16 + $35 ) + ($18 + $37), ($11 + $30) + ($13 + $32) + ($15 + $34) + ($17 + $36) + ($19 + $38)}' A.dat >> A_full.dat

示例文件:

A1.dat:

-0.908  0.3718E-03  0.2227E-02  0.1216E-05  0.6719E-05  0.1697E-05  0.1052E-04  0.1697E-05  0.1052E-04  0.5774E-07  0.3360E-06  0.5774E-07  0.3360E-06  0.5418E-06  0.3169E-05  0.1972E-06  0.1099E-05  0.1610E-05  0.9417E-05
-0.902  0.1042E-02  0.3365E-02  0.3427E-05  0.1021E-04  0.4837E-05  0.1619E-04  0.4837E-05  0.1619E-04  0.1623E-06  0.5093E-06  0.1623E-06  0.5093E-06  0.1522E-05  0.4803E-05  0.5530E-06  0.1661E-05  0.4522E-05  0.1427E-04
-0.895  0.1962E-02  0.4677E-02  0.6479E-05  0.1428E-04  0.9232E-05  0.2289E-04  0.9232E-05  0.2289E-04  0.3064E-06  0.7100E-06  0.3064E-06  0.7100E-06  0.2870E-05  0.6694E-05  0.1042E-05  0.2310E-05  0.8530E-05  0.1988E-04
-0.889  0.3067E-02  0.6167E-02  0.1019E-04  0.1893E-04  0.1470E-04  0.3064E-04  0.1470E-04  0.3064E-04  0.4806E-06  0.9388E-06  0.4806E-06  0.9388E-06  0.4500E-05  0.8850E-05  0.1629E-05  0.3047E-05  0.1337E-04  0.2629E-04

A2.dat:

-0.908  0.9081E-04  0.5463E-03  0.9126E-05  0.5564E-04  0.4880E-06  0.3004E-05  0.4880E-06  0.3004E-05  0.2218E-06  0.1311E-05  0.2218E-06  0.1311E-05  0.1433E-06  0.8079E-06  0.1452E-06  0.8808E-06  0.4262E-06  0.2402E-05
-0.902  0.2531E-03  0.8191E-03  0.2580E-04  0.8502E-04  0.1377E-05  0.4565E-05  0.1377E-05  0.4565E-05  0.6264E-06  0.2000E-05  0.6264E-06  0.2000E-05  0.3994E-06  0.1211E-05  0.4063E-06  0.1327E-05  0.1188E-05  0.3599E-05
-0.895  0.4742E-03  0.1130E-02  0.4894E-04  0.1194E-03  0.2604E-05  0.6378E-05  0.2604E-05  0.6378E-05  0.1187E-05  0.2805E-05  0.1187E-05  0.2805E-05  0.7483E-06  0.1670E-05  0.7638E-06  0.1839E-05  0.2225E-05  0.4963E-05
-0.889  0.7357E-03  0.1480E-02  0.7735E-04  0.1591E-03  0.4094E-05  0.8448E-05  0.4094E-05  0.8448E-05  0.1874E-05  0.3729E-05  0.1874E-05  0.3729E-05  0.1161E-05  0.2186E-05  0.1191E-05  0.2419E-05  0.3452E-05  0.6496E-05

A.dat:

-0.908  0.3718E-03  0.2227E-02  0.1216E-05  0.6719E-05  0.1697E-05  0.1052E-04  0.1697E-05  0.1052E-04  0.5774E-07  0.3360E-06  0.5774E-07  0.3360E-06  0.5418E-06  0.3169E-05  0.1972E-06  0.1099E-05  0.1610E-05  0.9417E-05       -0.908  0.9081E-04  0.5463E-03  0.9126E-05  0.5564E-04  0.4880E-06  0.3004E-05  0.4880E-06  0.3004E-05  0.2218E-06  0.1311E-05  0.2218E-06  0.1311E-05  0.1433E-06  0.8079E-06  0.1452E-06  0.8808E-06  0.4262E-06  0.2402E-05
-0.902  0.1042E-02  0.3365E-02  0.3427E-05  0.1021E-04  0.4837E-05  0.1619E-04  0.4837E-05  0.1619E-04  0.1623E-06  0.5093E-06  0.1623E-06  0.5093E-06  0.1522E-05  0.4803E-05  0.5530E-06  0.1661E-05  0.4522E-05  0.1427E-04       -0.902  0.2531E-03  0.8191E-03  0.2580E-04  0.8502E-04  0.1377E-05  0.4565E-05  0.1377E-05  0.4565E-05  0.6264E-06  0.2000E-05  0.6264E-06  0.2000E-05  0.3994E-06  0.1211E-05  0.4063E-06  0.1327E-05  0.1188E-05  0.3599E-05
-0.895  0.1962E-02  0.4677E-02  0.6479E-05  0.1428E-04  0.9232E-05  0.2289E-04  0.9232E-05  0.2289E-04  0.3064E-06  0.7100E-06  0.3064E-06  0.7100E-06  0.2870E-05  0.6694E-05  0.1042E-05  0.2310E-05  0.8530E-05  0.1988E-04       -0.895  0.4742E-03  0.1130E-02  0.4894E-04  0.1194E-03  0.2604E-05  0.6378E-05  0.2604E-05  0.6378E-05  0.1187E-05  0.2805E-05  0.1187E-05  0.2805E-05  0.7483E-06  0.1670E-05  0.7638E-06  0.1839E-05  0.2225E-05  0.4963E-05
-0.889  0.3067E-02  0.6167E-02  0.1019E-04  0.1893E-04  0.1470E-04  0.3064E-04  0.1470E-04  0.3064E-04  0.4806E-06  0.9388E-06  0.4806E-06  0.9388E-06  0.4500E-05  0.8850E-05  0.1629E-05  0.3047E-05  0.1337E-04  0.2629E-04       -0.889  0.7357E-03  0.1480E-02  0.7735E-04  0.1591E-03  0.4094E-05  0.8448E-05  0.4094E-05  0.8448E-05  0.1874E-05  0.3729E-05  0.1874E-05  0.3729E-05  0.1161E-05  0.2186E-05  0.1191E-05  0.2419E-05  0.3452E-05  0.6496E-05

A_full.dat:

-0.908 0.00046261 0.0027733 1.4712e-05 8.9407e-05 3.62278e-06 2.10697e-05
-0.902 0.0012951 0.0041841 4.1655e-05 0.00013674 1.01681e-05 3.18896e-05
-0.895 0.0024362 0.005807 7.9091e-05 0.000192216 1.91659e-05 4.4386e-05
-0.889 0.0038027 0.007647 0.000125128 0.000256206 3.00122e-05 5.86236e-05

有关输出文件(A_full.dat)的7列的更多信息:

所有输入的A?.dat文件在col 1中具有相同的值.A_full.dat也必须具有相同的col 1.
A_full.dat的col 2应该是所有A?.dat文件的col 2的总和.
A_full.dat的col 3应该是所有A?.dat文件的col 3的总和.
col 4应该是所有A?.dat文件的cols 4、6和8的总和.
A_full.dat的col 5应该是所有A?.dat文件的cols 5、7和9的总和.
A_full.dat的col 6应该是所有A?.dat文件的cols 10、12、14、16和18的总和.
A_full.dat的col 7应该是所有A?.dat文件的cols 11、13、15、17和19的总和.

All of the input A?.dat files have the same values in the col 1. A_full.dat must also have the same col 1.
col 2 of A_full.dat should be the summation of col 2 of all A?.dat files.
col 3 of A_full.dat should be the summation of col 3 of all A?.dat files.
col 4 of A_full.dat should be the summation of cols 4, 6, and 8 of all A?.dat files.
col 5 of A_full.dat should be the summation of cols 5, 7, and 9 of all A?.dat files.
col 6 of A_full.dat should be the summation of cols 10, 12, 14, 16, and 18 of all A?.dat files.
col 7 of A_full.dat should be the summation of cols 11, 13, 15, 17, and 19 of all A?.dat files.

起初，我以一种令人困惑的方式发布了这个问题，但是在@ markp-fuso的输入的帮助下，我对其进行了编辑以使其易于理解.

At first, I posted this question in a confusing manner, but with the help of @markp-fuso's input, I've edited it to make it easier to comprehend.

推荐答案

注意:根据OP的最新更改(在输出中包括字段$ 1)进行了更新，并结合了EdMorton对的建议awk/for 循环.

NOTE: Updated based on OPs latest changes (include field $1 in the output), and incorporating EdMorton's suggestion for the awk/for loop.

基于OP当前的 awk 命令...

Based on OP's current awk command ...

awk '{print ($2 + $21 + $40 + $59 + $78 + $97), ($3 + $22 + $41 + $60 + $79 + $98), ($4 + $23 + $42 + $61 + $80 + $99) + ($6 + $25 + $44 + $63 + $82 + $101) + ($8 + $27 + $46 + $65 + $84 + $103), ($5 + $24 + $43 + $62 + $81 + $100) + ($7 + $26 + $45 + $64 + $83 + $102) + ($9 + $ 28 + $47 + $66 + $85 + $104), ($10 + $29 + $48 + $67 + $86 + $105) + ($12 + $31 + $50 + $69 + $88 + $107) + ($14 + $33 + $52 + $71 + $90 + $109) + ($16 + $35 + $54 + $73 + $92 + $111) + ($18 + $37 + $56 + $75 + $94 + $113), ($11 + $30 + $49 + $68 + $87 + $106) + ($13 + $32 + $51 + $70 + $89 + $108) + ($15 + $34 + $53 + $72 + $91 + $110) + ($17 + $36 + $55 + $74 + $93 + $112) + ($19 + $38 + $57 + $76 + $95 + $114)}' A.dat >> A_full.dat

...以及各种评论和编辑，我得出以下结论:

... as well as an assortment of comments and edits, I come away with the following:

所有输入文件都有19个字段
所有输入文件的行数相同
不确定要对字段#1进行什么操作(由于问题编辑和令人困惑的解释)
期望的输出由7列( col1 至 col7 )组成
col1 :第一个文件中字段#1的副本(所有输入文件中的字段#1都应相同)
col2 :所有输入文件中字段#2的总和
col3 :(否定)来自所有输入文件的字段#3的总和
col4 :所有输入文件中字段#4，#6和#8的总和
col5 :(否定)所有输入文件中的字段#5，#7和#9的总和
col6 :所有输入文件中字段#10，#12，#14，#16和#18的总和
col7 :所有输入文件中字段#11，#13，#15，#17和#19的总和
现在，我假设我们希望输出行按从输入文件中读取的相同顺序进行排序(即，输入NR ==输出NR)
OP需要一种可以处理 n 个输入文件的解决方案

all input files have 19 fields
all input files have the same number of rows
unsure what, if anything, is to be done with field #1 (due to question edits and confusing explanation)
desired output consists of 7x columns (col1 to col7) for each set of input rows
col1 : copy of field #1 from first file (field #1 should be the same in all input files)
col2 : summation of field #2 from all input files
col3 : (negated) summation of field #3 from all input files
col4 : summation of fields #4, #6 and #8 from all input files
col5 : (negated) summation of fields #5, #7 and #9 from all input files
col6 : summation of fields #10, #12, #14, #16 and #18 from all input files
col7 : summation of fields #11, #13, #15, #17 and #19 from all input files
for now I'm assuming we want the output rows ordered by the same order in which they're read from the input files (ie, input NR == output NR)
OP needs a solution that can work with n number of input files

将 n 个输入文件而不是 paste (ing)粘贴到一个大文件( A.dat )中，然后使用awk 解析 nx 19 列，我建议让 awk 读取单个数据文件( A?.dat )并累积即时获得所需的数据值.

Instead of paste(ing) the n input files into a single big file (A.dat) and then having awk parse the n x 19 columns, I propose having awk read the individual data files (A?.dat) and accumulate the desired data values 'on the fly'.

一种 awk 解决方案:

awk '
FNR==NR { col1[FNR]=$1 }
        { col2[FNR]+=($2)
          col3[FNR]-=($3)
          col4[FNR]+=($4 + $6 + $8)
          col5[FNR]-=($5 + $7 + $9)
          col6[FNR]+=($10 + $12 + $14 + $16 + $18)
          col7[FNR]+=($11 + $13 + $15 + $17 + $19)
        }
END     { for ( i=1 ; i <= FNR ; i++ )
              printf "%s %7.5f %7.5f %8.6f %8.6f %d %d\n", col1[i], col2[i], col3[i], col4[i], col5[i], col6[i], col7[i]
    }
' A1.dat A2.dat A3.dat ... An.dat

注意: printf 格式基于OP提供的有限示例输出；可能需要根据较大数据集的预期结果进行调整.

NOTE: printf formats are based on the limited sample output provided by OP; may need to adjust these based on the desired results from a larger data set.

注意:此 awk 解决方案的缺点是我们必须将所有(输出)数据存储在一组数组中，这反过来可能会导致内存不足如果我们要处理大量的行，则会出现使用问题.

NOTE: One downside to this awk solution is that we have to store all (output) data in a set of arrays which, in turn, could lead to memory usage issues if we're dealing with a large volume of rows.

将OP样本输入文件( A.dat )解析回前两个原始数据文件中:

Parsing the OPs sample input file (A.dat) back out into the first 2x original data files:

$ cat A1.dat
  4.429  0.3620E-01  0.3919E-01  0.1063E-01  0.9525E-02  0.9146E-02  0.7986E-02  0.9146E-02  0.7986E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  4.436  0.3489E-01  0.3876E-01  0.1022E-01  0.9461E-02  0.8803E-02  0.7872E-02  0.8803E-02  0.7872E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  4.442  0.3364E-01  0.3852E-01  0.9760E-02  0.9469E-02  0.8402E-02  0.7801E-02  0.8402E-02  0.7801E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  4.449  0.3260E-01  0.3917E-01  0.9364E-02  0.9753E-02  0.8040E-02  0.8083E-02  0.8040E-02  0.8083E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00

$ cat A2.dat
   4.429  0.4333E-01  0.3393E-01  0.6788E-02  0.6654E-02  0.8228E-02  0.7242E-02  0.8228E-02  0.7242E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
   4.436  0.4101E-01  0.3372E-01  0.6687E-02  0.6563E-02  0.7849E-02  0.7179E-02  0.7849E-02  0.7179E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
   4.442  0.3861E-01  0.3437E-01  0.6561E-02  0.6437E-02  0.7440E-02  0.7192E-02  0.7440E-02  0.7192E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
   4.449  0.3646E-01  0.3667E-01  0.6462E-02  0.6514E-02  0.7091E-02  0.7443E-02  0.7091E-02  0.7443E-02  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00

针对这两个输入文件运行建议的 awk 解决方案会生成:

Running the proposed awk solution against these 2x input files generates:

$ awk '{ col1[FNR]+= .... }' A1.dat A2.dat
4.429 0.07953 -0.07312 0.052166 -0.046635 0 0
4.436 0.07590 -0.07248 0.050211 -0.046126 0 0
4.442 0.07225 -0.07289 0.048005 -0.045892 0 0
4.449 0.06906 -0.07584 0.046088 -0.047319 0 0

这篇关于通过长时间的awk操作(涉及100多个列)来概括(特定)脚本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！