本文介绍了如何使用preg_match_all()获取所有子组匹配的捕获信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新/注释:

参考:使用命名模式子例程的PCRE正则表达式.

(请仔细阅读:)

我有一个字符串,其中包含可变数量的段(简化):

I have a string that contains a variable number of segments (simplified):

$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well

我现在想匹配段并通过matchs数组返回它们:

I would like now to match the segments and return them via the matches array:

$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);

这只会返回捕获组2:DD last 匹配项.

This will only return the last match for the capture group 2: DD.

是否可以通过一次正则表达式执行来检索所有子模式捕获(AABBDD)? preg_match_all不适合吗?

Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?

$subject$pattern均被简化.自然地,使用AABB,..这样的常规列表,可以更容易地通过其他功能(例如explode)或$pattern的变体来提取.

Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.

但是我要特别问如何返回所有preg_...函数族的子组匹配项.

But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.

对于现实生活中的情况,假设您有多个(嵌套)级别的子模式匹配项变种.

For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.

这是一个伪代码示例,描述了一些背景知识. 想象以下内容:

This is an example in pseudo code to describe a bit of the background. Imagine the following:

令牌的常规定义:

   CHARS := [a-z]+
   PUNCT := [.,!?]
   WS := [ ]

$subject获取基于这些标记的标记.令牌化存储在令牌数组中(类型,偏移量,...).

$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).

然后将该数组转换为字符串,每个令牌包含一个字符:

That array is then transformed into a string, containing one character per token:

   CHARS -> "c"
   PUNCT -> "p"
   WS -> "s"

因此,现在可以在令牌流字符串索引上基于令牌(而不是字符类等)运行正则表达式.例如

So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.

   regex: (cs)?cp

表示一组或多组字符,后跟标点符号.

to express one or more group of chars followed by a punctuation.

由于我现在可以将自定义标记表示为正则表达式,因此下一步是构建语法.这只是一个例子,这是一种ABNF样式:

As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:

   words = word | (word space)+ word
   word = CHARS+
   space = WS
   punctuation = PUNCT

如果我现在将 word 的语法编译成一个(令牌)正则表达式,我自然希望每个 word 的所有子组匹配项.

If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.

  words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+)    # words resolved to tokens
  words = (c+)|((c+)s)+c+                         # words resolved to regex

直到这一点我都可以编写代码.然后我遇到了一个问题,即子组匹配只包含其最后一个匹配.

I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.

因此,我可以选择为我自己为语法创建自动机(我想避免使语法表达式保持通用),或者某种程度上使preg_match对我有用,以便我能保留这一点.

So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.

基本上就是全部.也许现在可以理解为什么我简化了这个问题.

That's basically all. Probably now it's understandable why I simplified the question.

相关:

  • pcrepattern man page
  • Get repeated matches with preg_match_all()

推荐答案

类似的线程:

检查选择的答案,再加上我的可能有用,我将在此处重复:

Check the chosen answer plus mine might be useful I will duplicate there:

来自 http://www.php.net/manual/zh/regexp.reference.repetition.php :

我个人放弃了,打算分两个步骤进行.

I personally give up and going to do this in 2 steps.

编辑:

我在其他线程中看到有人声称lookbehind方法可以做到这一点.

I see in that other thread someone claimed that lookbehind method is able doing it.

这篇关于如何使用preg_match_all()获取所有子组匹配的捕获信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 18:48