问题描述
我正尝试从亚马逊下载html文档,但由于某种原因,我得到了一个错误的编码字符串,例如"^?K g g e".
I'm trying to download an html document from Amazon but for some reason I get a bad encoded string like "��K��g��g�e".
这是我尝试的代码:
using (var webClient = new System.Net.WebClient())
{
var url = "https://www.amazon.com/dp/B07H256MBK/";
webClient.Encoding = Encoding.UTF8;
var result = webClient.DownloadString(url);
}
使用HttpClient时会发生相同的事情:
Same thing happens when using HttpClient:
var url = "https://www.amazon.com/dp/B07H256MBK/";
var httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(url);
我还尝试以字节为单位读取结果,然后将其转换回UTF-8,但仍然得到相同的结果.另请注意,这种情况并非总是会发生.例如,昨天我运行了这段代码大约2个小时,并且得到了正确编码的HTML文档.但是,今天我总是得到不好的编码结果.它每隔一天发生一次,所以不是一次性的事情.
I also tried reading the result in Bytes and then convert it back to UTF-8 but I still get the same result. Also note that this DOES NOT always happen. For example, yesterday I was running this code for ~2 hours and I was getting a correctly encoded HTML document. However today I always get a bad encoded result. It happens every other day so it's not a one time thing.
==================================================================
==================================================================
但是,当我使用HtmlAgilitypack的包装器时,它会按预期每次运行:
However when I use the HtmlAgilitypack's wrapper it works as expected everytime:
var url = "https://www.amazon.com/dp/B07H256MBK/";
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(url);
即使我显式定义正确的编码,是什么导致WebClient和HttpClient获得错误的编码字符串?HtmlAgilityPack的包装器默认情况下如何工作?
What causes the WebClient and HttpClient to get a bad encoded string even when I explicitly define the correct encoding? And how does the HtmlAgilityPack's wrapper works by default?
感谢您的帮助!
推荐答案
我启动了Firefox的Web开发工具,请求了该页面,并查看了响应标题:
I fired up Firefox's web dev tools, requested that page, and looked at the response headers:
看到 content-encoding:gzip
吗?这意味着响应是gzip编码的.
See that content-encoding: gzip
? That means the response is gzip-encoded.
事实证明,即使您不发送 Accept-Encoding:gzip
标头(已通过其他工具验证),Amazon也会为您提供使用gzip压缩的响应.这有点顽皮,但并不罕见,而且很容易解决.
It turns out that Amazon gives you a response compressed with gzip even when you don't send an Accept-Encoding: gzip
header (verified with another tool). This is a bit naughty, but not that uncommon, and easy to work around.
这根本不是字符编码的问题. HttpClient
擅长从 Content-Type
标头中找出正确的编码.
This wasn't a problem with character encodings at all. HttpClient
is good at figuring out the correct encoding from the Content-Type
header.
您可以使用以下命令告诉 HttpClient
解压缩响应:
You can tell HttpClient
to un-zip responses with:
HttpClientHandler handler = new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.GZip,
};
using (var client = new HttpClient(handler))
{
// your code
}
如果您使用的是NuGet软件包版本4.1.0到4.3.2,它将自动设置,否则您需要自己进行设置.
This will be set automatically if you're using the NuGet package versions 4.1.0 to 4.3.2, otherwise you'll need to do it yourself.
您可以使用WebClient进行相同的操作,,但这很难.
You can do the same with WebClient, but it's harder.
这篇关于C#WebClient-DownloadString编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!