使用R查找序列

本文介绍了使用R查找序列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何编写一个函数，该函数接受一个DNA序列(作为单个字符串)和一个数字"n> = 2"，并返回一个以三元组"AAA"开头的带有所有DNA子序列(作为字符串)的向量"GAA"和结尾为三元组"AGT"，并且在起点和终点之间至少有2个，最多为"n"个三元组.

How to write a function that accepts a DNA sequence (as a single string) and a number "n >= 2" and returns a vector with all DNA subsequences (as strings) that start with the triplet "AAA" or "GAA" and end with the triplet "AGT" and have at least 2 and at most "n" other triplets between the start and the end.

第一季度:

for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2,
the answer is c=("GAACCCACTAGT", "AAATTTGGGAGT").

第二季度:

e.g, n=10
the answer is:  c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")

推荐答案

这是一种可行的方法.

它使用基于2的正则表达式->n以三个[A-Z]为核心重复.

it uses a regex based on 2 -> n repetitions of three [A-Z] as it's core.

library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10  # << set as desired

#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"

stringr::str_extract_all( dna, regex )

# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"

# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"

这篇关于使用R查找序列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！