我有以下代码:

use WWW::Mechanize;
$url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
$mech = WWW::Mechanize->new();
$mech->get($url);
$content = $mech->content();
while ($content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/) {
    $refresh = $1;
    $link = $2;
    sleep $refresh;
    $mech->get($link);
    $content = $mech->content();
}
$mech->save_content("output.txt");

当我将分配给$url的URL放在浏览器中时,最终的结果是下载PDF文件,但是当我运行上述代码时,我最终会得到一个不同的文件。我认为Mechanize可能无法正确处理cookies。我怎样才能让它工作?

最佳答案

当您请求http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E时,首先会得到指向https的重定向。
然后你会得到一个带有META REFRESH的页面。这将在/TMP中为您提供一个文件。
在获取https://daccess-ods.un.org/TMP/xxx.xxx.html并跟随META REFRESHhttps://documents-dds-ny.un.org/doc/UNDOC/GEN/G15/263/87/PDF/G1526387.pdf?OpenElement之后,它仍然不会下载文档,但会给出错误消息。
从浏览器检查标题的原因是,浏览器设置了三个cookie,而WWW::Mechanize仅设置一个cookie:
citrix_ns_id=xxx
citrix_ns_id_u.un.org_u%2F_wat=xxx
LtpaToken=xxx
那么这些饼干是从哪里来的呢?结果发现TMP html有不止一个元刷新。它还有这个HTML:

<frameset ROWS="0,100%" framespacing="0" FrameBorder="0" Border="0">
  <frame name="footer" scrolling="no" noresize target="main" src="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234" marginwidth="0" marginheight="0">
  <frame name="main" src="" scrolling="auto" target="_top">
  <noframes>
  <body>
  <p>This page uses frames, but your browser doesn't support them.</p>
  </body>
  </noframes>
</frameset>

此urlhttps://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234确实设置了这些cookies。
Set-Cookie: LtpaToken=xxx; domain=.un.org; path=/
Set-Cookie: citrix_ns_id=xxx; Domain=.un.org; Path=/; HttpOnly
Set-Cookie: citrix_ns_id_.un.org_%2F_wat=xxx; Domain=.un.org; Path=/

因此,通过更改代码来考虑这一点:
use strict;
use WWW::Mechanize;

my $url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
$mech->get($url);
my $more = 1;
while ($more) {
    $more = 0;
    my $follow_link;
    my @links = $mech->links;
    foreach my $link (@links) {
        if ($link->tag eq 'meta') {
            $follow_link = $link;
        }
        if (($link->tag eq 'frame') && ($link->url)) {
            $mech->follow_link( url => $link->url );
            $mech->back;
        }
    }
    if ($follow_link) {
        $more = 1;
        $mech->follow_link( url => $follow_link->url );
    }
}
$mech->save_content("output.txt");

output.txt成功包含pdf。
$ file output.txt
output.txt: PDF document, version 1.5

关于html - 机械化不像浏览器那样处理Cookie,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35935911/

10-13 00:37