我正在尝试创建一个正则表达式,以匹配所有不在引号之间并且也不以'
这是将文本解析为ssml(语音合成标记语言)。正则表达式将用于在一个点之后自动添加<break time="200ms"/>
。
我已经创建了一个匹配所有不在引号之间的点的模式:
/\.(?=(?:[^"]|"[^"]*")*$)/g
上面的正则表达式给出以下输出:(^ = match)
This. is.a.<break time="0.5s"/> test sentence.
^ ^ ^ ^
但是我要创建的正则表达式不应该与第三点匹配。
匹配项应如下所示:
This. is.a.<break time="0.5s"/> test sentence.
^ ^ ^
有人可以帮助我吗?
最佳答案
在这种情况下,小组捕获可以提供帮助。
只要捕获另一组中的点,就可以使用甚至捕获字符串表达式:
/((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g
[^"\.]
表示除点引号或双引号外的任何字符。"(?:\\\\|\\"|[^"])*"
表示字符串表达式(可能包含转义的双引号或点)因此
(?:[^"\.]|"(?:\\\\|\\"|[^"])*")*
将消耗掉点(.
)以外的所有字符,并尽可能忽略字符串表达式中的点。在此字符串上执行此正则表达式时:
"Thi\\\"s." is..a.<break time="0\".5s"/> test sentence.
将产生以下匹配:
比赛1
完全匹配,字符从0到15:
"Thi\\\"s." is.
第1组,从字符14到15:
.
比赛2
完全匹配,从字符15到16:
.
第1组,从字符15到16:
.
比赛3
完全匹配,从字符18到55:
<break time="0\".5s"/> test sentence.
组1.从char 54到55:
.
您可以使用this wonderful tool对其进行测试
编写表达式的方式,捕获的点将始终位于第二组中,因此点的索引将为
match.index
+ group[1].length
(如果为group[1] exists
)。注意:给定的表达式说明了转义的双引号,否则解决方案在遇到某些问题时将失败。
以下是总结的工作解决方案:
// g is needed to collect all matches
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;
function getMatchedPointsNotFollowedByChevronAndOutsideOfStrings(input) {
let match;
const result = [];
// reset the regexp lastIndex because we're
// re-using it at each call
regexp.lastIndex = 0;
while ((match = regexp.exec(input))) {
// index of the dot is the match index +
// the length of group 1 if present
result.push(match.index + (match[1] ? match[1].length : 0));
}
// the result now contains the indices of all '.'
// conforming to the rule we chose
return result;
}
// Escaping escaped string is tricky, proof as console.log
const testString = `"Thi\\\\\\"s." is..a.<break time="0\\".5s"/> test sentence.`;
console.log(testString);
// final result
console.log(
getMatchedPointsNotFollowedByChevronAndOutsideOfStrings(testString)
);
编辑:
OP实际上希望在文本中的点之后将暂停标记添加为原始html字符串。
完全有效的解决方案:
// g is needed to collect all matches
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;
function addPauses(input) {
let match;
const dotOffsets = [];
// reset the regexp lastIndex because we're
// re-using it at each call
regexp.lastIndex = 0;
const ts = Date.now();
// first collect all points offsets
while ((match = regexp.exec(input))) {
// offset of the dot is the match index + the length of first group if present
dotOffsets.push(match.index + (match[1] ? match[1].length : 0));
}
// no points found, we can return the input as it is
if (dotOffsets.length === 0) {
return input;
}
// there are points, reconstruct the string with a break added after each
const reduction = dotOffsets.reduce(
(res, offset, index) => {
// a segment is a substring of the input from a point
// to the next (from 0 before the first point)
const segment = input.substring(
index <= 0 ? 0 : dotOffsets[index - 1] + 1,
offset + 1
);
return `${res}${segment}<break time="200ms"/>`;
},
''
);
// adding the last segment from the last point to the end of the string
const rest = input.substring(dotOffsets[dotOffsets.length - 1] + 1);
return `${reduction}${rest}`;
}
const testString = `
<p>
This is a sample from Wikipedia.
It is used as an example for this snippet.
</p>
<p>
<b>Hypertext Markup Language</b> (<b>HTML</b>) is the standard
<a href="/wiki/Markup_language.html" title="Markup language">
markup language
</a> for documents designed to be displayed in a
<a href="/wiki/Web_browser.html" title="Web browser">
web browser
</a>.
It can be assisted by technologies such as
<a href="/wiki/Cascading_Style_Sheets" title="Cascading Style Sheets">
Cascading Style Sheets
</a>
(CSS) and
<a href="/wiki/Scripting_language.html" title="Scripting language">
scripting languages
</a>
such as
<a href="/wiki/JavaScript.html" title="JavaScript">JavaScript</a>.
</p>
`;
console.log(`Initial raw html:\n${testString}\n`);
console.log(`Result (added 2 pauses):\n${addPauses(testString)}\n`);
关于javascript - 如何使用regexp在原始html文本节点中的非终止点后插入中断标签,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57654770/