用正则表达式替换锚href值

本文介绍了用正则表达式替换锚href值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图通过在href值之前添加网站的网址来替换网页中所有锚元素的href值.

I am trying to replace all the href values of the anchor elements in a web page by adding url of my website before the href value.

在您建议使用XML/HTML解析器之前，请知道我尝试了一堆，并且它们做得很好，但是它们全部返回的HTML只是弄乱了我要解析的某些网站.这可能与首先编写的损坏的html有关，但是由于我对此没有控制权，因此regex是唯一的方法.所以我想出了这段代码:

Before you suggest XML/HTML parser, please know that I tried bunch of them, and they do great job, however all of them return HTML that is simply messed up for some of the websites that I'm trying to parse. That has to do probably with broken html that is written in the first place, but as I don't have control over that, regex is the only way here. So I came up with this code:

$response = '<h2><a href="http://www.google.com/test">Link</a></h2>';
$pattern = "/(<a .*?href=\"|')([^\"'#]+)(.*?<\/a>)/i";
$response = preg_replace_callback($pattern, 'html_href',  $response);
function html_href($matches) {
    return  $matches[1] . "http://example.com/" . $matches[2] .  $matches[3];
}

它实际上将$response更改为:

<h2><a href="http://example.com/http://www.google.com/test">Link</a></h2>

那太好了.但是后来我发现此正则表达式也与此相匹配:

Thats great. But later I found out that this regex somehow matches also this:

$response = "var href = $(this).attr('rel'); $(this).replaceWith('<a href=\"' + decodeURL(href) + '\"><span>' + anchor+ '</span></a>');";
$pattern = "/(<a .*?href=\"|')([^\"'#]+)(.*?<\/a>)/i";
$response = preg_replace_callback($pattern, 'html_href',  $response);
function html_href($matches) {
        return  $matches[1] . "http://example.com/" . $matches[2] .  $matches[3];
 }

$ response变为:

and here $response becomes:

var href = $(this).attr('http://example.com/rel'); $(this).replaceWith('<a href="' + decodeURL(href) + '"><span>' + anchor+ '</span></a>');

我真的不明白，attr()方法内部的这个匹配和替换是怎么回事?这个正则表达式模式不是只匹配以<a开头的字符串的一部分吗?我想避免匹配javascript中的内容...

I don't really get, how come this inside attr() method is matched and replaced? Isn't this regex pattern supposed to match only parts of a string that start with <a ? I would like to avoid matching things inside javascript...

用正则表达式替换锚href值

问题描述

推荐答案