本文介绍了HtmlAgilityPack HtmlWeb.Load返回空Document的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用HtmlAgilityPack在一个Web爬虫应用程序没有问题加载网页的最后2个月。

I have been using HtmlAgilityPack for the last 2 months in a Web Crawler Application with no issues loading a webpage.

现在,当我尝试加载这个特定的网页,文档OuterHtml是空的,所以这个测试失败

Now when I try to load a this particular webpage, the document OuterHtml is empty, so this test fails

var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", pageHtml);

我可以从网站加载其他页面,没有任何问题,如设置

I can load another page from the site with no problems, such as setting

url = "http://www.prettygreen.com/news/";

在过去,我曾经与编码的问题,我跟htmlWeb.OverrideEncoding和htmlWeb.AutoDetectEncoding没有运气发挥各地。我不知道有什么可以在这里是问题与此网页。

In the past I once had an issue with encodings, I played around with htmlWeb.OverrideEncoding and htmlWeb.AutoDetectEncoding with no luck. I have no idea what could be the issue here with this webpage.

推荐答案

看来这个网站需要启用Cookie。所以当创建一个cookie容器您的网络请求应该解决这个问题:

It seems this website requires cookies to be enabled. So creating a cookie container for your web request should solve the issue:

var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
htmlWeb.PreRequest += request =>
    {
        request.CookieContainer = new System.Net.CookieContainer();
        return true;
    };
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", outerHtml);

这篇关于HtmlAgilityPack HtmlWeb.Load返回空Document的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-26 07:01