本文介绍了R中的重叠比赛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已搜索并找到此论坛讨论以实现重叠匹配的效果.

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

我还发现了以下 SO 问题,查找索引以执行此任务,但找不到任何有关在R语言中抓取重叠匹配项的简明扼要的信息.

I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.

在执行时,我可以通过使用积极先行断言来使用支持( PCRE )的大多数语言来执行此任务前瞻内部的捕获组以捕获重叠的匹配项.

I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

但是,尽管实际上以与其他语言相同的方式执行此操作,但在R中使用perl=T却没有结果.

But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.

> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""

同时使用stringistringr软件包也是如此.

The same goes for using both the stringi and stringr package.

> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""

执行此操作时应返回的正确结果是:

The correct results that should be returned when executing this are:

[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

编辑

  1. 我很清楚regmatches在捕获的比赛中不能很好地工作,但是恰好是什么引起了regmatch中的这种行为,为什么没有返回结果? 我正在寻找一个更详细的答案.

  1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.

stringistringr软件包不能通过regmatches执行此操作吗?

Is the stringi and stringr package not capable of performing this over regmatches?

请随时添加到我的答案中,或者提出与我发现不同的解决方法.

Please feel free to add to my answer or come up with a different workaround than I have found.

推荐答案

标准regmatches不适用于捕获的匹配项(特别是同一字符串中的多个捕获的匹配项).在这种情况下,由于您要匹配"前瞻(忽略捕获),因此匹配本身为零长度.还有一个regmatches()<-函数可以说明这一点.烦人

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

请注意所有字母的保存方式,我们只是将零长度匹配的位置替换为可以观察到的内容.

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

我创建了一个 regcapturedmatches()函数,我经常将其用于此类任务.例如

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

gregexpr可以很好地捕获所有数据,因此,如果您不想使用此帮助器功能,则可以随时从该对象中提取数据.

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

这篇关于R中的重叠比赛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:00