ruby - 如何使用Nokogiri从HTML代码获取邮件地址

如何使用Nokogiri从HTML代码获取邮件地址？我正在考虑使用正则表达式，但我不知道这是否是最佳解决方案。

示例代码:

<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:example@example.com">Mail to me</a>
</body>
</html>

如果Nokogiri不在某些标签之间，则是否存在一种获取邮件地址的方法？

最佳答案

您可以使用xpath提取电子邮件地址。

选择器//a将选择页面上的任何a标记，并且您可以使用href语法指定@属性，因此//a/@href将为您提供页面上所有href标记的a。

如果页面上混合了可能的a标签和不同的url类型(例如http:// url)，则可以使用xpath函数进一步缩小所选节点的范围。选择器

//a[starts-with(@href, \"mailto:\")]/@href

将为您提供所有a属性以“mailto:”开头的href标签的href节点。

将所有内容放在一起，并添加一些额外的代码以从属性值的开头删除“mailto:”:

require 'nokogiri'

selector = "//a[starts-with(@href, \"mailto:\")]/@href"

doc = Nokogiri::HTML.parse File.read 'my_file.html'

nodes = doc.xpath selector

addresses = nodes.collect {|n| n.value[7..-1]}

puts addresses

使用如下所示的测试文件:

<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:example@example.com">Mail to me</a>
<a href="http://example.com">A Web link</a>
<a>An empty anchor.</a>
</body>
</html>

此代码输出所需的example@example.com。 addresses是文档中mailto链接中所有电子邮件地址的数组。

关于ruby - 如何使用Nokogiri从HTML代码获取邮件地址，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/9492259/