Biostrings gregexpr2 给出错误，而 gregexpr 工作正常

本文介绍了Biostrings gregexpr2 给出错误，而 gregexpr 工作正常的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将 gregexpr 替换为 gregexpr2 以检测重叠匹配.当我尝试.

I'm replacing gregexpr with gregexpr2 to detect overlapping matches. When I try.

>subSeq
 3000-letter "DNAString" instance
 seq: ACACGTGTTCTATTTTCATTTGCTGACATTTTCTAGTGCATCATTTTTTATTTTATTTTCATT....

gregexpr2("TAAT|ATTA",subSeq)

Error in matches[[i]] : subscript out of bounds

而

gregexpr("TAAT|ATTA",subSeq)

工作正常.

发生了什么?

推荐答案

如果你阅读 gregexpr2 文档:

It is quite clear if you read gregexpr2 documentation:

这是对仅进行精确匹配的标准 gregexpr 函数的替代.标准 gregexpr() 重叠时会错过匹配项.gregexpr2 函数查找所有匹配项，但它仅适用于固定"模式，即精确匹配(不支持正则表达式).

我将上面的相关句子加粗.因此，您的 gregexpr2 在您的输入中搜索 TAAT|ATTA 文本，并且由于没有管道，因此找不到匹配项.

I bolded the relevant sentence above. So, your gregexpr2 searches for TAAT|ATTA text in your input, and since there is no pipe, no match is found.

如果您需要正则表达式重叠匹配，请使用 stringr 中的 str_match_all:

If you need regex overlapping matches, use str_match_all from stringr:

library(stringr)
> x <- "TAATTA"
> str_match_all(x, "(?=(TAAT|ATTA))")
[[1]]
     [,1] [,2]
[1,] ""   "TAAT"
[2,] ""   "ATTA"

str_match_all 函数保留所有捕获组值(与 (...) 模式部分匹配)，因此您将收集由于捕获而导致的所有重叠匹配在正向前瞻中使用的组(这是一种非消耗模式，让正则表达式引擎在字符串内的每个位置触发模式).

The str_match_all function keeps all the capturing group values (matched with (...) pattern parts), so you will collect all the overlapping matches due to the capturing group used inside a positive lookahead (that is a non-consuming pattern letting the regex engine fire the pattern at each location inside the string).

模式详情:

(?= - 非消耗正向前瞻的开始将在字符串内的每个位置触发
- ( - 捕获组的开始
  - TAAT - TAAT 子串
  - | - 或
  - ATTA - ATTA 子串
  - (?= - start of a non-consuming positive lookahead that will trigger at each location inside the string
    - ( - start of a capturing group
      - TAAT - TAAT substring
      - | - or
      - ATTA - ATTA substring
      这篇关于Biostrings gregexpr2 给出错误，而 gregexpr 工作正常的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！