javascript正则表达式从锚标记中提取锚文本和URL

javascript正则表达式从锚标记中提取锚文本和URL

本文介绍了javascript正则表达式从锚标记中提取锚文本和URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个名为'input_content'的javascript变量中有一段文本,该文本包含多个锚标记/链接。我想匹配所有锚标签并提取锚文本和URL,并将其放入类似(或类似)的数组中:

I have a paragraph of text in a javascript variable called 'input_content' and that text contains multiple anchor tags/links. I would like to match all of the anchor tags and extract anchor text and URL, and put it into an array like (or similar to) this:

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

我对它采取了一个裂缝( ),但是我超越了这一点。感谢您的帮助!

I've taken a crack at it (http://pastie.org/339755), but I am stumped beyond this point. Thanks for the help!

推荐答案

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

这假设您的锚点始终采用< a href =...>形式...... < / a> 即如果有任何其他属性(例如, target )它将无效。正则表达式可以改进为了适应这种情况。

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

要分解正则表达式:


/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

每次调用我们的匿名函数都会收到三个标记作为第二个,第三个和第四个标记参数,即参数[1],参数[2],参数[3]:

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:


  • arguments [1]是整个锚点

  • 参数[2]是href部分

  • 参数[3]是里面的文字

我们将使用hack将这三个参数作为新数组推送到我们的主匹配数组中。 参数内置变量不是真正的JavaScript数组,所以我们必须应用 split 数组方法在它上提取我们想要的项目:

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

Array.prototype.slice.call(arguments, 1, 4)

这将从索引1开始从参数中提取项目在索引4结束(不包括)。

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

给予:


<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

这篇关于javascript正则表达式从锚标记中提取锚文本和URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 21:55