问题描述
假设我有如下所示的文本字符串:
Assume I have text strings that look something like this:
A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3
这里我想识别导致A是一个标记,I3
是一个标记等)> 到由 only 个 IX
标记(即 I1
、I2
或 I3
) 包含一个 I3
.这个子序列的长度可以是 1(即是单个 I3
标记),也可以是无限长度,但始终需要包含至少 1 个 I3
标记,并且只能包含 IX
标记.在通向IX
子序列的子序列中,可以包含I1
和I2
,但不能包含I3
.
Here I want to identify sequences of markers (A
is a marker, I3
is a marker etc.) that leads up to a subsequence consisting only of IX
markers (i.e. I1
, I2
, or I3
) that contains an I3
. This subsequence can have a length of 1 (i.e. be a single I3
marker) or it can be of unlimited length, but always needs to contain at least 1 I3
marker, and can only contain IX
markers. In the subsequence that leads up to the IX
subsequence, I1
and I2
can be included, but never I3
.
在上面的字符串中我需要识别:
In the string above I need to identify:
A-B-C-I1-I2-D-E-F
导致包含 I3
和
D-D-D-D
导致 I1-I1-I2-I1-I1-I3-I3
子序列,其中至少包含 1 个 I3
.
which leads up to the I1-I1-I2-I1-I1-I3-I3
subsequence that contains at least 1 I3
.
这里有一些额外的例子:
Here are a few additional examples:
A-B-I3-C-I3
从这个字符串我们应该识别AB
,因为它后面是一个包含I3
的1的子序列,还有C
,因为它后跟包含 I3
的 1 子序列.
from this string we should identify A-B
because it is followed by a subsequence of 1 that contains I3
, and also C
, because it is followed by a subsequence of 1 that contains I3
.
和:
I3-A-I3
这里应该标识A
,因为它后面跟着一个包含I3
的子序列1.第一个 I3
本身不会被识别,因为我们只对后面跟着包含 I3
的 IX
标记的子序列感兴趣.
here A
should be identified because it is followed by a subsequence of 1 which contains I3
. The first I3
itself will not be identified, because we are only interested in subsequences that are followed by a subsequence of IX
markers that contains I3
.
如何编写一个通用函数/正则表达式来完成这个任务?
How can I write a generic function/regex that accomplishes this task?
推荐答案
使用 strsplit
> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C"
或
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"
这篇关于基于复杂规则识别子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!