问题描述
我将 gregexpr 替换为 gregexpr2 以检测重叠匹配.当我尝试.
I'm replacing gregexpr with gregexpr2 to detect overlapping matches. When I try.
>subSeq
3000-letter "DNAString" instance
seq: ACACGTGTTCTATTTTCATTTGCTGACATTTTCTAGTGCATCATTTTTTATTTTATTTTCATT....
gregexpr2("TAAT|ATTA",subSeq)
Error in matches[[i]] : subscript out of bounds
而
gregexpr("TAAT|ATTA",subSeq)
工作正常.
发生了什么?
推荐答案
如果你阅读 gregexpr2
文档:
It is quite clear if you read gregexpr2
documentation:
这是对仅进行精确匹配的标准 gregexpr 函数的替代.标准 gregexpr()
重叠时会错过匹配项.gregexpr2
函数查找所有匹配项,但它仅适用于固定"模式,即精确匹配(不支持正则表达式).
我将上面的相关句子加粗.因此,您的 gregexpr2
在您的输入中搜索 TAAT|ATTA
文本,并且由于没有管道,因此找不到匹配项.
I bolded the relevant sentence above. So, your gregexpr2
searches for TAAT|ATTA
text in your input, and since there is no pipe, no match is found.
如果您需要正则表达式重叠匹配,请使用 stringr 中的 str_match_all
:
If you need regex overlapping matches, use str_match_all
from stringr:
library(stringr)
> x <- "TAATTA"
> str_match_all(x, "(?=(TAAT|ATTA))")
[[1]]
[,1] [,2]
[1,] "" "TAAT"
[2,] "" "ATTA"
str_match_all
函数保留所有捕获组值(与 (...)
模式部分匹配),因此您将收集由于捕获而导致的所有重叠匹配在正向前瞻中使用的组(这是一种非消耗模式,让正则表达式引擎在字符串内的每个位置触发模式).
The str_match_all
function keeps all the capturing group values (matched with (...)
pattern parts), so you will collect all the overlapping matches due to the capturing group used inside a positive lookahead (that is a non-consuming pattern letting the regex engine fire the pattern at each location inside the string).
模式详情:
(?=
- 非消耗正向前瞻的开始 将在字符串内的每个位置触发(
- 捕获组的开始TAAT
-TAAT
子串|
- 或ATTA
-ATTA
子串
(?=
- start of a non-consuming positive lookahead that will trigger at each location inside the string(
- start of a capturing groupTAAT
-TAAT
substring|
- orATTA
-ATTA
substring
这篇关于Biostrings gregexpr2 给出错误,而 gregexpr 工作正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!