问题描述
我必须处理一个大的输入文件(2.9 GB)以产生特定格式的输出(如下所述):
输入文件的示例是:
GS RSPH14
CC建立HSA_Jul2014(GRCh38; hg38):chr22:23141092..23152092(REVERSE)
FT TFBS CHIP:FR000000873; SP1(Jurkat);结论:14980218; 23144712..23145380
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR029934262; C / EBPBETA(A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS CLXC15
CC Build HSA_Jul2014(GRCh38; hg38):chr3:23144021..23155021(REVERSE)
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR034213319; CTCF(MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582
描述:输入文件中的每一行都以 GS 或 CC 或 FT ,我想忽略GS *行。对于CC *行,我想将它拆分为:,并将第一个索引(基于0的计数) ,根据我的输入样本,它将是 chr22 (在第2行)和 chr3 (在第7行)。对于FT行,我想把它分成; ,并把第一个和最后一个index (根据我的输入示例的第3行,它将是 SP1(Jurkat)和 23144712..23145380 ,分别),并希望以这种方式处理它们,使得我的输出文件应该如下所示:
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
我的尝试:
strong>我可以在; 上拆分文件,以便获得所需的列。我试过的是: awk -F'[;]''{print $ 2'\t$ 4}'sample.txt> output.txt的。这使我输出为:
hg38):chr22:23141092..23152092(REVERSE)
SP1(Jurkat) (A-549)23150853..23151108
hg38):chr3:23144021 .23155021(REVERSE)
ER-ALPHA(MCF-7)23147445..23148194
CTCF(MCF-7)23151393..23151582
现在从第一和第六行开始,我只想要 chr22 和 chr3 和其他行(非第1和第6个,原始以 GS> 或 CC 开头))最后一列并在前面追加相应的字符。还应该处理其他行的第一个索引,以便在()上分割并保留第一个索引。
解决方案使用awk:
awk'
$ 1 ==CC{split($ 0 ,a,/:/); key = a [2]}
$ 1 ==FT{
n = split($ 0,a,/; /)
split(a [2 ],b,FS)
split(a [n],c,/[.]{2}/)
print key,c [1],c [2],b [1]
'档|列-t
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
I have to process a big input file (2.9 GB) to produce the output in a particular required format (describe below:)
Sample of input file is:
GS RSPH14 CC Build HSA_Jul2014 (GRCh38; hg38): chr22:23141092..23152092 (REVERSE) FT TFBS CHIP: FR000000873; SP1 (Jurkat); PMID:14980218; 23144712..23145380 FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194 FT TFBS CHIP: FR029934262; C/EBPBETA (A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108 GS CLXC15 CC Build HSA_Jul2014 (GRCh38; hg38): chr3:23144021..23155021 (REVERSE) FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194 FT TFBS CHIP: FR034213319; CTCF (MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582Description: Every line in input file starts with either GS or CC or FT, I want to ignore the GS* lines. For the CC* line, I want to split it on : and take the 1st index (0-based counting), according to my input sample it will be chr22 (in line 2) and chr3 (in line 7). For the FT line, I want to split it on ; and take the 1st and last index (according to my input sample's line 3 it will be SP1 (Jurkat) and 23144712..23145380, respectively) and want to proccess them in such a way that my output file should look like this:
chr22 23144712 23145380 SP1 chr22 23147445 23148194 ER-ALPHA chr22 23150853 23151108 C/EBPBETA chr3 23147445 23148194 ER-ALPHA chr3 23151393 23151582 CTCFAny help will be much appreciated!
My Try: I am able to split the file on ; so that I get my desired columns. What I tried is: awk -F'[;]' '{print $2 "\t" $4}' sample.txt > output.txt. This gives me output as:
hg38): chr22:23141092..23152092 (REVERSE) SP1 (Jurkat) 23144712..23145380 ER-ALPHA (MCF-7) 23147445..23148194 C/EBPBETA (A-549) 23150853..23151108 hg38): chr3:23144021..23155021 (REVERSE) ER-ALPHA (MCF-7) 23147445..23148194 CTCF (MCF-7) 23151393..23151582Now from the 1st and 6th line I only want chr22 and chr3 and from the other lines (non 1st and 6th which were originally starting with GS or CC) only the last column and append the corresponding chr in front. Also 1st index of other lines should be processed to split on ( and keep the 1st index.
解决方案Using awk:
awk ' $1 == "CC" { split($0, a, /:/); key=a[2] } $1 == "FT" { n = split($0, a, /;/) split(a[2], b, FS) split(a[n], c, /[.]{2}/) print key, c[1],c[2], b[1] } ' file | column -tchr22 23144712 23145380 SP1 chr22 23147445 23148194 ER-ALPHA chr22 23150853 23151108 C/EBPBETA chr3 23147445 23148194 ER-ALPHA chr3 23151393 23151582 CTCF
这篇关于文件通过awk或grep进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!