javascript - 如何使用regexp在原始html文本节点中的非终止点后插入中断标签

我正在尝试创建一个正则表达式，以匹配所有不在引号之间并且也不以'
这是将文本解析为ssml（语音合成标记语言）。正则表达式将用于在一个点之后自动添加<break time="200ms"/>。

我已经创建了一个匹配所有不在引号之间的点的模式：

/\.(?=(?:[^"]|"[^"]*")*$)/g

上面的正则表达式给出以下输出：（^ = match）

This. is.a.<break time="0.5s"/> test sentence.
    ^   ^ ^                                  ^

但是我要创建的正则表达式不应该与第三点匹配。
匹配项应如下所示：

This. is.a.<break time="0.5s"/> test sentence.
    ^   ^                                    ^

有人可以帮助我吗？

最佳答案

在这种情况下，小组捕获可以提供帮助。

只要捕获另一组中的点，就可以使用甚至捕获字符串表达式：

/((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g

[^"\.]表示除点引号或双引号外的任何字符。

"(?:\\\\|\\"|[^"])*"表示字符串表达式（可能包含转义的双引号或点）

因此(?:[^"\.]|"(?:\\\\|\\"|[^"])*")*将消耗掉点（.）以外的所有字符，并尽可能忽略字符串表达式中的点。

在此字符串上执行此正则表达式时：

"Thi\\\"s." is..a.<break time="0\".5s"/> test sentence.

将产生以下匹配：

比赛1

完全匹配，字符从0到15："Thi\\\"s." is.
第1组，从字符14到15：.

比赛2

完全匹配，从字符15到16：.
第1组，从字符15到16：.

比赛3

完全匹配，从字符18到55：<break time="0\".5s"/> test sentence.
组1.从char 54到55：.

您可以使用this wonderful tool对其进行测试

编写表达式的方式，捕获的点将始终位于第二组中，因此点的索引将为match.index + group[1].length（如果为group[1] exists）。

注意：给定的表达式说明了转义的双引号，否则解决方案在遇到某些问题时将失败。

以下是总结的工作解决方案：

// g is needed to collect all matches
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;

function getMatchedPointsNotFollowedByChevronAndOutsideOfStrings(input) {
  let match;
  const result = [];

  // reset the regexp lastIndex because we're
  // re-using it at each call
  regexp.lastIndex = 0;

  while ((match = regexp.exec(input))) {
      // index of the dot is the match index +
      // the length of group 1 if present
      result.push(match.index + (match[1] ? match[1].length : 0));
  }

  // the result now contains the indices of all '.'
  // conforming to the rule we chose
  return result;
}

// Escaping escaped string is tricky, proof as console.log
const testString = `"Thi\\\\\\"s." is..a.<break time="0\\".5s"/> test sentence.`;
console.log(testString);

// final result
console.log(
    getMatchedPointsNotFollowedByChevronAndOutsideOfStrings(testString)
);

编辑：

OP实际上希望在文本中的点之后将暂停标记添加为原始html字符串。

完全有效的解决方案：

// g is needed to collect all matches
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;

function addPauses(input) {
    let match;
    const dotOffsets = [];

    // reset the regexp lastIndex because we're
    // re-using it at each call
    regexp.lastIndex = 0;
    const ts = Date.now();

    // first collect all points offsets
    while ((match = regexp.exec(input))) {
        // offset of the dot is the match index + the length of first group if present
        dotOffsets.push(match.index + (match[1] ? match[1].length : 0));
    }

    // no points found, we can return the input as it is
    if (dotOffsets.length === 0) {
        return input;
    }

    // there are points, reconstruct the string with a break added after each
    const reduction = dotOffsets.reduce(
        (res, offset, index) => {
            // a segment is a substring of the input from a point
            // to the next (from 0 before the first point)
            const segment = input.substring(
              index <= 0 ? 0 : dotOffsets[index - 1] + 1,
              offset + 1
            );
            return `${res}${segment}<break time="200ms"/>`;
        },
        ''
    );

    // adding the last segment from the last point to the end of the string
    const rest = input.substring(dotOffsets[dotOffsets.length - 1] + 1);
    return `${reduction}${rest}`;
}

const testString = `
<p>
    This is a sample from Wikipedia.
    It is used as an example for this snippet.
</p>
<p>
    <b>Hypertext Markup Language</b> (<b>HTML</b>) is the standard
    <a href="/wiki/Markup_language.html" title="Markup language">
        markup language
    </a> for documents designed to be displayed in a
    <a href="/wiki/Web_browser.html" title="Web browser">
        web browser
    </a>.
    It can be assisted by technologies such as
    <a href="/wiki/Cascading_Style_Sheets" title="Cascading Style Sheets">
        Cascading Style Sheets
    </a>
    (CSS) and
    <a href="/wiki/Scripting_language.html" title="Scripting language">
        scripting languages
    </a>
    such as
    <a href="/wiki/JavaScript.html" title="JavaScript">JavaScript</a>.
</p>
`;


console.log(`Initial raw html:\n${testString}\n`);

console.log(`Result (added 2 pauses):\n${addPauses(testString)}\n`);

关于javascript - 如何使用regexp在原始html文本节点中的非终止点后插入中断标签，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/57654770/