问题描述
我试图写一些正则表达式,它会通过一些文本,由我们的编辑器写,并应用< acronym>
标记到它找到的我们在词汇表中保存的缩写集合的第一个实例。
I'm trying to write a bit of regex which would go through some text, written by our Editors, and apply an <acronym>
tag to the first instance it finds of an abbreviation set we hold in our "Glossary of Terms".
所以对于这个例子,我使用了缩写 ITS
。
So for this example I've used the abbreviation ITS
.
第一件事我想我会做的是设置一个混合的scenerios我可以测试,例如 ITS
坐标标点符号,在HTML标签& ;
1st thing I thought I'd do is setup an example with a mix of scenerios I could test against, i.e. ITS
sitting with punctuation, in HTML tags & ones that we've applied this to already (in other words the script has run through this before, so no need to do again).
我几乎在那里,但是我已经应用了这个脚本已经(这个脚本已经通过这个脚本,所以不需要再做)只是陷入了最后一点: - (。
I'm almost there but just got stuck at the last point :-(.
这里是我到目前为止的正则表达式 - < + Example> ITS< [^] +> | ITS
Here's the regex I've got so far - <[^<|]+?>?>ITS<[^<]+?>|ITS
> ITS 在 BOLD 中):
这是另一个测试,因为我还是想更新< p>
ITS < / p>
的其他HTML标签。
This is another test as I still want to update <p>
ITS</p>
that have other HTML tags wrapped around them.`
ITS 想要开始句子的句子,以及结束 ITS 的句子和包含标点符号的句子。
ITS want ones that start sentences and ones that finish ITS. ITS, and ones which are wrapped in punctuation.`
测试链接:
< a href =index.cfm& ; / a>
TO:
AND I WANT THIS CHANGE TO :
这是另一个测试,因为我仍然想更新具有其他HTML的< acronym title =ITS> ITS< / acronym>
< acronym title =ITS> ITS< / acronym>想要开始句子的那些和完成<首字母缩写title =ITS> ITS< / acronym>的句子。 < acronym title =ITS> ITS< / acronym> ;,以及用标点符号包装的那些。
测试链接:
< acronym title =ITS>< a href =index.cfm> ITS< / a&首字母缩写>
Are there any Reg Ex experts out there that could help me finish this off? Any other hints tips would also be appreciated.
** UPDATE **
不知道这是否有帮助,但是这样做可以帮助我完成这个任务。这将在该段落中找到:
** UPDATE ** Don't know if this helps but this would find the only in that paragraph :
< acronym [^<] * ITS< / acronym>
这将找到所有的ITS:
and this will find all the ITS :
< [^ ;] * ITS <[^ <] * | ITS
我真正需要的是一种方法,
What I really need is a way of combining these to say find all the ITSs but exclude those in tags.
非常感谢
James
Thanks a lot,James
PS这将被放置在ColdFusion应用程序中,如果这有助于任何具体语法。
P.S. This is going to be placed in a ColdFusion application if that helps anyone in specific syntax.
尝试解析:
推荐答案
这是你的基本问题:regex不是解析器。这个问题已经被接近了很多次,没有通用的解决方案,只有正则表达式。你可以通过使用lookahead,lookbehind和一些非常复杂的步法来伪装一个点,但是你很快就达到了表达式复杂的维护点。
Here is your basic problem: regex is not a parser. This problem has been approached many times, and there is no general purpose solution with only regex. You can fake it to a point by using lookahead, lookbehind, and some really complicated footwork, but you quickly get to the point where your expression is way to complicated to maintain.
我可以建议一对夫妇的方法。
I can suggest a couple approaches.
如果你使用符合XML的文本,你可以使用xmlparse()解析文本,然后遍历结果结构,将正则表达式应用到每个节点的xmltext。
If you are using text that is XML compliant, you can parse the text using xmlparse() and then step through the resulting structure, applying your regex to the xmltext of each node.
或者,您可以尝试使用占位符替换文本块中的每个标记,在生成的文本上进行替换,然后还原占位符。
Alternately, you can try replacing each tag in the text block with a placeholder, doing a replace on the resulting text, then restoring the placeholders.
显然,这些都不是完美的,但是,一些调整,可能会让你去哪里。
Obviously, neither of these is perfect, but either, with some tweaking, may get you where you're going.
这篇关于使用正则表达式代码卡住,以将HTML标记应用于文本,但排除如果在<?>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!