问题描述
我需要从Google搜索结果中抓取一些仅在浏览器中显示的内容(我怀疑是在启用Javascript的时候)–具体来说,他们的知识图谱人们也在搜索"内容.
I need to scrape some content from Google search results that only shows in browsers (I suspect it's when Javascript is enabled) – specifically, their Knowledge Graph "People also search for" content.
我使用了request
和cheerio
的组合进行抓取,并且已经设法从.com
域中强制加载结果,但是,知识库框未显示在结果的body
中,可能是因为它是javascript生成的内容.
I use a combination of request
and cheerio
to scrape and has already managed to force-load results from .com
domain, however, the knowledgebase box does not show up in the body
of my results, probably because it's javascript-generated content.
有人知道我可以添加一项设置还是可以使用另一个库吗?
Anybody knows if there's a setting I could add or another library I could use?
这是下面的代码.谢谢!
Here's my code below. Thank you!
var request = require('request');
var cheerio = require("cheerio");
request = request.defaults({jar: true});
var options = {
url: 'http://www.google.com/ncr',
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16'
}
};
request(options, function () {
request('https://www.google.com/search?gws_rd=ssl&site=&source=hp&q=google&oq=google', function (error, response, body) {
var $ = cheerio.load(body);
$("li").each(function() {
var link = $(this);
var text = link.text();
console.log(text);
});
});
});
推荐答案
由于仅下载静态内容,因此无法使用节点的请求.为了呈现JavaScript,您必须使用浏览器.幸运的是,有无头浏览器仅用于此目的.我建议 PhantomJS .
You can't using node's request as you are merely downloading the static content. In order to render JavaScript you have to use a browser. Fortunately there are headless browsers just for this purpose. I suggest PhantomJS.
这篇关于使用Node.Js中的请求抓取JavaScript生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!