问题描述
假设HTML页面的内容是
<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>
我只想提取包含粗体文本的链接.我该如何使用WWW :: Mechanize?
输出
ABC industry
Movies with ABC
我用过
@arr=$m->links();
foreach(@arr){print $_->text;}
但是这会找到页面中的所有URL.
不使用额外的模块来解析页面内容,使用WWW::Mechanize
很难达到目标.但是,还有其他模块可以使您轻松实现这一目标.
以下是使用 Mojo::DOM
的示例,该示例可让您根据需要选择元素将在CSS中完成. Mojolicious发行版还包含 Mojo::UserAgent
,因此您可以轻松地将代码迁移到Mojo如果您不太喜欢WWW::Mechanize
.
# $html is the content of the page
my $dom = Mojo::DOM->new($html);
# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;
$collection->each( sub {
say "LINK: " . $_->all_text;
} );
# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
$_->ancestors('a')->each( sub {
say "All in one: " . $_->all_text
} )
} );
这是一个带有链接示例列表的演示:
<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link
<span>with a span
<b>and a "b" tag</b>
even though
</span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>
输出:
LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC
如果使用Mojo::UserAgent
而不是WWW::Mechanize
,则搜索会更加容易. Mojo::UserAgent
可以get
一个页面(就像WWW::Mechanize
一样),并且可以使用$ua->get($url)->res->dom
访问返回页面的DOM.然后,您可以在上面链接您的查询,以给出以下内容:
my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );
# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub {
$_->ancestors('div.spacer')->each( sub {
say $_->all_text
} )
} );
输出:
1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file
如果无法立即理解,Mojolicious文档中有很多示例!
有关Mojo::DOM
和Mojo::UserAgent
的有用的8分钟入门视频,请查看 Mojocast第5集. /p>
Suppose content of HTML pages is
<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>
I want to extract only links that contain bold text. How can i do it using WWW::Mechanize?
Output
ABC industry
Movies with ABC
I used
@arr=$m->links();
foreach(@arr){print $_->text;}
but this finds all URLs in the page.
Without using extra modules that can parse the contents of the page, it's going to be difficult to achieve your goal with WWW::Mechanize
. However, there are other modules that will allow you to achieve this very easily.
Here is an example using Mojo::DOM
, which uses lets you select elements as you would do in CSS. The Mojolicious distribution also contains Mojo::UserAgent
, so you could migrate your code over to Mojo fairly easily if you are not too tied to WWW::Mechanize
.
# $html is the content of the page
my $dom = Mojo::DOM->new($html);
# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;
$collection->each( sub {
say "LINK: " . $_->all_text;
} );
# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
$_->ancestors('a')->each( sub {
say "All in one: " . $_->all_text
} )
} );
Here's a demonstration with a sample list of links:
<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link
<span>with a span
<b>and a "b" tag</b>
even though
</span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>
Output:
LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC
If you use Mojo::UserAgent
instead of WWW::Mechanize
your search can be even easier. Mojo::UserAgent
can get
a page (just like WWW::Mechanize
), and the DOM of the returned page can be accessed using $ua->get($url)->res->dom
. You can then chain your query above on this, to give the following:
my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );
# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub {
$_->ancestors('div.spacer')->each( sub {
say $_->all_text
} )
} );
Output:
1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file
There are lots of examples in the Mojolicious documentation in case this isn't immediately comprehensible!
For a helpful 8 minute introductory video to Mojo::DOM
and Mojo::UserAgent
check out Mojocast Episode 5.
这篇关于使用WWW :: Mechanize查找包含粗体文本的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!