问题描述
我试图通过在href值之前添加网站的网址来替换网页中所有锚元素的href值.
I am trying to replace all the href values of the anchor elements in a web page by adding url of my website before the href value.
在您建议使用XML/HTML解析器之前,请知道我尝试了一堆,并且它们做得很好,但是它们全部返回的HTML只是弄乱了我要解析的某些网站.这可能与首先编写的损坏的html有关,但是由于我对此没有控制权,因此regex是唯一的方法.所以我想出了这段代码:
Before you suggest XML/HTML parser, please know that I tried bunch of them, and they do great job, however all of them return HTML that is simply messed up for some of the websites that I'm trying to parse. That has to do probably with broken html that is written in the first place, but as I don't have control over that, regex is the only way here. So I came up with this code:
$response = '<h2><a href="http://www.google.com/test">Link</a></h2>';
$pattern = "/(<a .*?href=\"|')([^\"'#]+)(.*?<\/a>)/i";
$response = preg_replace_callback($pattern, 'html_href', $response);
function html_href($matches) {
return $matches[1] . "http://example.com/" . $matches[2] . $matches[3];
}
它实际上将$response
更改为:
<h2><a href="http://example.com/http://www.google.com/test">Link</a></h2>
那太好了.但是后来我发现此正则表达式也与此相匹配:
Thats great. But later I found out that this regex somehow matches also this:
$response = "var href = $(this).attr('rel'); $(this).replaceWith('<a href=\"' + decodeURL(href) + '\"><span>' + anchor+ '</span></a>');";
$pattern = "/(<a .*?href=\"|')([^\"'#]+)(.*?<\/a>)/i";
$response = preg_replace_callback($pattern, 'html_href', $response);
function html_href($matches) {
return $matches[1] . "http://example.com/" . $matches[2] . $matches[3];
}
$ response变为:
and here $response becomes:
var href = $(this).attr('http://example.com/rel'); $(this).replaceWith('<a href="' + decodeURL(href) + '"><span>' + anchor+ '</span></a>');
我真的不明白,attr()方法内部的这个匹配和替换是怎么回事?这个正则表达式模式不是只匹配以<a
开头的字符串的一部分吗?我想避免匹配javascript中的内容...
I don't really get, how come this inside attr() method is matched and replaced? Isn't this regex pattern supposed to match only parts of a string that start with <a
? I would like to avoid matching things inside javascript...
推荐答案
只有几种常见方法:
-
首选
<a\s+
而不是<a␣
此后使用[^<>]*
而不是.*?
进行标记内属性跳过. (这可能是它在其他地方成功匹配JavaScript的主要原因.)
Use [^<>]*
thereafter instead of .*?
for in-tag attribute skipping. (This is probably the main reason it supuriosly matched JavaScript elsewhere.)
当您想允许"
或'
时,请像在中间一样使用字符类[\"\']
.
When you want to allow "
or '
use a character class [\"\']
just like you did in the middle.
例如,使用([^<\"\'>]+)
更严格地匹配href =内容.
Match the href= contents more strictly with ([^<\"\'>]+)
for example.
然后确保此后再出现另一个[\"\']
.
Then ensure another [\"\']
comes thereafter.
并用[^<>]*>
声明<a
标签的结尾(这可能是导致与所需标签/链接不匹配的另一个主要原因).
And assert the end of the <a
tag with [^<>]*>
(that might be the other main culprit for not matching the desired tags/links).
如果连贯地适合您输入的html,请再次使用[^<>]+
作为链接文本.提示:请尽可能以高大的/x
表示法编写此类正则表达式模式.
Use [^<>]+
again for the link text, if that coherently suits your input html.Protip: write such regex patterns in a lofty /x
notation whenever you can.
这篇关于用正则表达式替换锚href值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!