本文介绍了Java正则表达式匹配开始/结束标签导致堆栈溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Java Pattern类的标准实现使用递归来实现多种形式的正则表达式(例如,某些运算符,替换).

The standard implementation of the Java Pattern class uses recursion to implement many forms of regular expressions (e.g., certain operators, alternation).

这种方法会导致输入字符串超过(相对较小)长度(可能不超过1,000个字符)而导致堆栈溢出问题,具体取决于所涉及的正则表达式.

This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.

一个典型的例子是下面的正则表达式,它使用交替从周围的XML字符串中提取可能包含多行的元素(名为Data):

A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data) from a surrounding XML string, which has already been supplied:

<Data>(?<data>(?:.|\r|\n)+?)</Data>

上面的正则表达式与Matcher.find()方法一起使用,以读取数据"捕获组并按预期工作,直到提供的输入字符串的长度超过1200个字符左右,在这种情况下,这会导致堆栈溢出

The above regex is used in with the Matcher.find() method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.

是否可以重写上述正则表达式以避免堆栈溢出问题?

Can the above regex be rewritten to avoid the stack overflow issue?

推荐答案

有关堆栈溢出问题的起源:

您的正则表达式(具有交替形式)与两个标签之间的任意1个以上的字符匹配.

Your regex (that has alternations) is matching any 1+ characters between two tags.

您可以在Pattern.DOTALL修饰符(或等效的嵌入标志(?s))中使用惰性点匹配模式,这也将使.匹配换行符:

You may either use a lazy dot matching pattern with the Pattern.DOTALL modifier (or the equivalent embedded flag (?s)) that will make the . match newline symbols as well:

(?s)<Data>(?<data>.+?)</Data>

请参见此regex演示

但是,在输入量巨大的情况下,惰性点匹配模式仍会占用大量内存.最好的解决方法是使用 展开循环方法 :

However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method:

<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>

请参见 regex演示

详细信息:

  • <Data>-文字<Data>
  • (?<data>-捕获组数据"的开始
    • [^<]*-除<
    • 以外的零个或多个字符
    • (?:<(?!/?Data>)[^<]*)*-0个或多个序列:
      • <(?!/?Data>)-一个<,后面没有Data>/Data>
      • [^<]*-除<
      • 以外的零个或多个字符
      • <Data> - literal text <Data>
      • (?<data> - start of the capturing group "data"
        • [^<]* - zero or more characters other than <
        • (?:<(?!/?Data>)[^<]*)* - 0 or more sequences of:
          • <(?!/?Data>) - a < that is not followed with Data> or /Data>
          • [^<]* - zero or more characters other than <

          这篇关于Java正则表达式匹配开始/结束标签导致堆栈溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 06:48