本文介绍了Node.js Cheerio解析器中断UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我像这样用Cheerio解析我的请求:

I parse my request with Cheerio like this:

var url = http://shop.nag.ru/catalog/16939.IP-videonablyudenie-OMNY/16944.IP-kamery-OMNY-c-vario-obektivom/16704.OMNY-1000-PRO;
request.get(url, function (err, response, body) {
  console.log(body);
   $ = cheerio.load(body);
   console.log($(".description").html());
});

作为输出,我看到了内容,但使用了不可读的奇怪编码:

And as output I see content but in unreadable strange encoding:

//Plain body console.log(body) (p.s. russian chars): 
<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1><p style

//  cheerio's console.log $(".description").html()
<h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY

目标网址链接编码为UTF-8格式。那么,为什么Cheerio破坏了我的编码?

Target url link coding is in UTF-8 format. So why Cheerio breaks my encoding?

尝试使用iconv编码我的身体反应:

Trying to use iconv to encode my body responce:

var body1 = iconv.decode(body, "utf-8");

console.log($(。description)。html( )); 仍返回奇怪的文本。

推荐答案

Cheerio并未破坏任何内容。它输出的是,它将由任何浏览器完全相同地呈现作为HTML输入。运行以下代码片段以了解我的意思:

Cheerio hasn't broken anything. It's outputting HTML entities, which will be rendered by any browser exactly the same as the HTML input. Run this snippet to see what I mean:

<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>

<h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY - &#x43F;&#x43E;&#x43F;&#x440;&#x43E;&#x431;&#x443;&#x439;&#x442;&#x435; &#x43D;&#x430;&#x439;&#x442;&#x438; &#x43B;&#x443;&#x447;&#x448;&#x435;</span></h1>

&#x423; 例如,字符<$ c编码为HTML实体的$ c>У,以& gt; 实体表示>

&#x423;, for example, is the character У encoded as an HTML entity, in the same way the entity &gt; represents >.

但是,如果要获取未编码的文本,可以设置 decodeEntities false 的选项:

However, if you want to get the unencoded text, you can set the decodeEntities option to false:

const $ = cheerio.load(
  `<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>`,
  { decodeEntities: false }
);


console.log($('span').html())
// => Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше
.as-console-wrapper{min-height:100%}
<script src="https://bundle.run/[email protected]"></script>

这篇关于Node.js Cheerio解析器中断UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-26 23:39