本文介绍了awk的基础上$ 2和$ 17个独立的行和做平均的$ 17日的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我们在这里有一个输入:
<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CPD-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-7788990,5555,6666,IC50,&GT; 150,嗯,1334,1331,奇,, 10,嗯,&GT; 15,-2,12 / 6/2006 0:00,2 / 16 / 2007 0:00,细胞,酶
49,CPD-7788990,8888,9999,IC50 ,, 200,嗯,1334,1331,奇,, 10,嗯,, 20,-3,12- / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.3,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
我们希望这个input.csv分成2档
如果该$ 2是同一和最大减闵在$ 17所述; = 1,平均$ 17和把它分为文件中的
如果$ 2相同,最大负MIN,以$ 17日> 1,平均$ 17和把它放到文件B。
请注意:如果有一个独特$ 2本身,我们想保持它在这里(CPD-6666666为例)
请注意:CPD-1111($最大为17分钟)= -1 - ( - 1.3)= 0.3&LT; 1
a:其中($最大为17分钟)LT = 1。新的$ 17 CPD-1111($ 2)的平均值(-1,-1.1,-1.2,-1.3)= -1.15
<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CPD-6666666,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
49,CPD-1111,8888,9999,IC50 ,, 400,嗯,1334,1331,奇,, 10,嗯,, 40,-1.15,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
B:在哪儿($最大为17分钟)> 1。新的$ 17 CPD-7788990($ 2)是平均(-1,-2,-3)= -2
<$p$p><$c$c>cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,CPD-7788990,1212,2323,IC50 ,, 100,嗯,1334,1331,奇,, 10,嗯,, 10,-2,12 / 6/2006 0:00,2 / 16/2007 0: 00,细胞,酶
下面是可以分开输入a和b,但还没有做平均尚未尝试。
#!的/ usr /斌/的awk -fBEGIN {FS =,; F1 =一个; F2 =B}FNR == 1 {打印$ 0 GT; F1;打印$ 0 GT; F2;下一个 }$ 2 = last_id和放大器;!&安培; FNR&GT; 2 {handleBlock()}{A [++ CNT] = $ 0; M [CNT] = $ 17; last_id = $ 2}END {handleBlock()}功能handleBlock(){如果(M [1] -m [CNT]下; = 1)FNAME = F1否则FNAME = F2为(ⅰ= 1; I&下; = CNT;我++){打印[Ⅰ]≥ FNAME}CNT = 0
}
我想知道是否有反正做平均a和b?谢谢你。
解决方案 您可以通过改变 handleBlock()
如下得到的输出文件的平均值>
函数handleBlock(){
如果(M [1] -m [CNT]下; = 1)FNAME = F1
否则FNAME = F2
#计算$ 17个领域的总和为组
对于(i = 1; I&LT; = CNT;我++){总和+ = M [I]}
#计算平均
平均= CNT&GT; 0?总和/ CNT:总和
#用于输出最大线,分割成一个输出数组:oarr
FCNT =分(一[1],oarr)
#修改输出数组的第17场
oarr [17] =平均
#编写更新阵列所需的文件,一个字段在一个时间
对于(i = 1; I&LT; = FCNT;我++){
的printf(%s%S,oarr [I],我== FCNT\\ n:FS?)GT; FNAME
}
CNT = 0;总和= 0
}
对原始脚本注释。
We have an input here:
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
We would like to separate this input.csv into 2 files
If the $2 is the same and the max minus min in $17 <= 1 ", average $17 and put it into "file a".
If the $2 is the same and the max minus min in $17 > 1 ", average $17 and put it into "file b".
Note: If there is an unique $2 itself, we would like to keep it here (cpd-6666666 as an example)
Note: cpd-1111 ($17 max-min) = -1-(-1.3)=0.3 < 1
a: where ($17 max-min)<=1 . The new $17 in cpd-1111($2) is the average of (-1,-1.1,-1.2,-1.3) = -1.15
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.15,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
b:where ($17 max-min)>1 . The new $17 in cpd-7788990($2) is the average of (-1,-2,-3) = -2
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
Here is the attempt which could separate input into a and b but haven't done average yet.
#!/usr/bin/awk -f
BEGIN {FS=","; f1="a"; f2="b"}
FNR==1 { print $0 > f1; print $0 > f2; next }
$2!=last_id && FNR > 2 { handleBlock() }
{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }
END { handleBlock() }
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
for( i=1;i<=cnt;i++ ) { print a[i] > fname }
cnt=0
}
May I know if there is anyway to do the average in a and b? Thanks.
解决方案
You can get the averages in the output files by altering handleBlock()
as follows:
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
# compute the sum of the $17 fields for the group
for( i=1;i<=cnt;i++ ) { sum+=m[i] }
# compute the average
avg = cnt > 0 ? sum/cnt : sum
# use the max line for the output, split into an output array: oarr
fcnt = split( a[1], oarr )
# modify the 17th field of the output array
oarr[17]=avg
# write the updated array to the desired file one field at a time
for( i=1;i<=fcnt;i++ ) {
printf( "%s%s", oarr[i], i==fcnt ? "\n" : FS ) > fname
}
cnt=0; sum=0
}
Check here for comments on the original script.
这篇关于awk的基础上$ 2和$ 17个独立的行和做平均的$ 17日的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!