本文介绍了使用两个文件时的grep问题 - 我试过了所有的东西的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我有两个文件(重新编码和读取),这些文件是使用nano命令构建和保存的,我想比较重新编码中的内容,以读取和提取重叠的行中的行。我一直在努力创造一个带有以前逻辑的when循环,但迄今为止没有成功。输出数据与循环中指定的模式不匹配,而使用grep / recode时不匹配。该脚本应该读取recode.txt中的每行,并将其与reads.fastq进行比较,在reads.txt之前提取每行匹配行加上一行和之后的2行,并将输出结果保存在不同的文件中(对于所有组合匹配行recode.txt)。以下是表格和代码: 文件 recode.txt : GTGTCTTA + ATCACGAC GTGTCTTA + ACAGTGGT GTGTCTTA + CAGATCCA GTGTCTTA + ACAAACGG GTGTCTTA + ACCCAGCA GTGTCTTA + AACCCCTC GTGTCTTA + CCCAACCT ATCACGAC + AAGGTTCA GTGTCTTA + GAAACCCA reads.fastq : ###### ############################# @ NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N 0:ATCACGAC + AAGGTTCA GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT + AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE / EEEEEEEEEE / EEEEEEEEEEEEEEEAEEEEEA @ NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0 :ACAAACGG + AAGGTTCA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + ################################ ### @ NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA + ACTTAGCA GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC + A AA / A#EAEEEEEAEAEEA /// EEAEEEEE /// AEEAEE / AA // EAA< EEE / E // AEEEAAA // E / A< 6 // EEA @ NB500931:113:HW53WBGX2:1:11101: 8246:1049 1:N:0:ATCACGAC + AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC + Script: #!/ bin / bash #PBS -l nodes = 1:ppn = 8,walltime = 96:00:00 读线 do echoworking on $ line grep -A3$ linereads.fastq | grep -v^ - $>> $ line_sorted.fastq done< recode.txt 所以,这两个文件都是在UNIX格式和下面的脚本(无循环)中工作流畅 根据脚本无循环: grep -A3ATCACGAC + AAGGTTCAreads.fastq | grep -v^ - $> sorted_file.fastq 我的输出应该是: GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT + $ b $ $ $ $ $ $ b @ NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC + AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC + 但是,我的输出使用循环,而给了我一个空文件,名称正确。你能帮我吗? 更新:我已经尝试了dos2unix来转换我的文件,它不起作用。 UPDATE:我编辑了问题以包含我的预期输出结果解决方案听起来像这是你想要做的: $ awk -F:'NR == FNR {a [$ 0 ]; next} $ NF in a {c = 3} c& amp; ampc-'recode.txt reads.fastq @ NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0 :ATCACGAC + AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC + 不需要shell循环(请参阅 why-is-using-a -shell-loop-to-process-text-considered-bad-practice ),只需将recode.txt中的值保存为数组索引,然后在读取reads.fastq时最后一个:-separated字段是数组的索引(即存在于recode.txt中),然后将计数器设置为3,然后在计数器大于零时打印每一行,每次递减计数器(请参见使用sed-or-awk-a- line-following-a-matching-pattern )。 将每个找到的记录保存在一个文件中,基于在这个final字段中的字符串名称,因为它看起来像你可能会在你的shell循环中做的那样: $ $ p $ $ $ $ $ $ awk - F:' NR == FNR {a [$ 0];下一个} $ NF在{c = 3;关闭(下); out = $ NF_sorted.fastq} c&& c-- {print>> out} 'recode.txt reads.fastq 请注意, fastq一次,而不是每行recode.txt的一行,因为你的shell循环正在做,所以你可以期望从这方面有大幅的性能改进。 最后 - 如果recode.txt只是reads.fastq中存在的所有最终字段的列表,那么您根本不需要它,这就是您需要将reads.fastq拆分为每个记录命名为3行的单独文件的全部内容基于以 @ 开头的每一行上最后一个:的值: awk -F:' / ^ @ / {c = 3;关闭(下); out = $ NF_sorted.fastq} c&& c-- {print>> out} 'reads.fastq I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before and 2 after in the reads.txt and save the output in different files (for all combined match lines per line of the recode.txt). Here are the tables and code:File recode.txt:GTGTCTTA+ATCACGACGTGTCTTA+ACAGTGGTGTGTCTTA+CAGATCCAGTGTCTTA+ACAAACGGGTGTCTTA+ACCCAGCAGTGTCTTA+AACCCCTCGTGTCTTA+CCCAACCTATCACGAC+AAGGTTCAGTGTCTTA+GAAACCCAFile reads.fastq:###################################@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCAGTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT+AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE/EEEEEEEEEE/EEEEEEEEEEEEEEEAEEEEEA@NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0:ACAAACGG+AAGGTTCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN+###################################@NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA+ACTTAGCAGAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC+AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA<EEE/E//AEEEAAA//E/A<6//EEA@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCACTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC+Script:#!/bin/bash#PBS -l nodes=1:ppn=8,walltime=96:00:00while read linedoecho "working on $line"grep -A3 "$line" reads.fastq | grep -v "^--$" >> "$line"_sorted.fastqdone<recode.txtSo, both files are in UNIX format and the following script (without a loop) works smoothAccording to the script without the looping: grep -A3 "ATCACGAC+AAGGTTCA" reads.fastq | grep -v "^--$" > sorted_file.fastqmy output should be: @NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT + @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC +However, my output using the loop while give me a empty file with the correct name. Can you please help me?UPDATE: I have tried dos2unix to convert my files and it didn't work.UPDATE: I edited the question to include my expected output 解决方案 Without seeing the expected output it's a guess but it sounds like this is what you're trying to do:$ awk -F: 'NR==FNR{a[$0];next} $NF in a{c=3} c&&c--' recode.txt reads.fastq@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCACTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC+No shell loop required (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for SOME of the reasons why that matters), just saves the values from recode.txt as array indices and then when reading reads.fastq if the last :-separated field is an index of the array (i.e. existed in recode.txt) then set a counter to 3 and then print every line while the counter is greater than zero, decrementing the counter each time (see printing-with-sed-or-awk-a-line-following-a-matching-pattern for other examples of printing text starting from a match).To save each found record in a file based on the string name in that final field as it looks like you might be trying to do in your shell loop would be:awk -F: ' NR==FNR { a[$0]; next } $NF in a { c=3; close(out); out=$NF"_sorted.fastq" } c&&c-- { print >> out }' recode.txt reads.fastqNote that that just reads "reads.fastq" once total, not once per line of "recode.txt" as your shell loop was doing, so you can expect a vast performance improvement from that aspect alone.Finally - if recode.txt is just a list of ALL of the final fields that exist in reads.fastq then you simply don't need it, this is all you need to split reads.fastq into separate files of 3 lines per record named based on the value after the last : on each line that starts with @:awk -F: ' /^@/ { c=3; close(out); out=$NF"_sorted.fastq" } c&&c-- { print >> out }' reads.fastq 这篇关于使用两个文件时的grep问题 - 我试过了所有的东西的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 17:22
查看更多