本文介绍了使用R查找序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何编写一个函数,该函数接受一个DNA序列(作为单个字符串)和一个数字"n> = 2",并返回一个以三元组"AAA"开头的带有所有DNA子序列(作为字符串)的向量"GAA"和结尾为三元组"AGT",并且在起点和终点之间至少有2个,最多为"n"个三元组.
How to write a function that accepts a DNA sequence (as a single string) and a number "n >= 2" and returns a vector with all DNA subsequences (as strings) that start with the triplet "AAA" or "GAA" and end with the triplet "AGT" and have at least 2 and at most "n" other triplets between the start and the end.
第一季度:
for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2,
the answer is c=("GAACCCACTAGT", "AAATTTGGGAGT").
第二季度:
e.g, n=10
the answer is: c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")
推荐答案
这是一种可行的方法.
它使用基于2的正则表达式->n以三个[A-Z]为核心重复.
it uses a regex based on 2 -> n repetitions of three [A-Z] as it's core.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"
stringr::str_extract_all( dna, regex )
# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"
# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
这篇关于使用R查找序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!