从使用延迟加载的页面中抓取整个

从使用延迟加载的页面中抓取整个

本文介绍了Puppeteer:从使用延迟加载的页面中抓取整个 html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在使用延迟加载的网页上抓取整个 html.我所尝试的是一直滚动到底部,然后使用 page.content().我还尝试在滚动到底部然后使用 page.content() 后滚动回页面顶部.两种方式都抓取表格的一些行,但不是全部,这是我的主要目标.我相信网页使用了 react.js 的延迟加载.

I am trying to grab the entire html on a web page that uses lazy load. What I have tried is scrolling all the way to the bottom and then use page.content(). I have also tried scrolling back to the top of the page after I scrolled to the bottom and then use page.content(). Both ways grabs some rows of the table, but not all of them, which is my main goal. I believe that the web page uses lazy loading from react.js.

const puppeteer = require('puppeteer');
const url = 'https://www.torontopearson.com/en/departures';
const fs = require('fs');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    await page.goto(url);
    await page.waitFor(300);

    //scroll to bottom
    await autoScroll(page);
    await page.waitFor(2500);

    //scroll to top of page
    await page.evaluate(() => window.scrollTo(0, 50));

    let html = await page.content();

    await fs.writeFile('scrape.html', html, function(err){
        if (err) throw err;
        console.log("Successfully Written to File.");
    });
    await browser.close();
});

//method used to scroll to bottom, referenced from user visualxcode on https://github.com/GoogleChrome/puppeteer/issues/305
async function autoScroll(page){
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0;
            var distance = 300;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

推荐答案

问题是链接页面使用的是库 反应虚拟化.该库仅呈现网站的可见部分.因此,您无法一次获得整张桌子.爬到表格底部只会将表格底部放入 DOM 中.

The problem is that the linked page is using the library react-virtualized. This library only renders the visible part of the website. Therefore you cannot get the whole table at once. Crawling to the bottom of the table will only put the bottom part of the table in the DOM.

要检查页面从何处加载其内容,您应该检查 DevTools 的网络选项卡.你会注意到页面的内容是从 this URL,它似乎以 JSON 格式提供了 DOM 的完美表示.所以,真的没有必要从页面中抓取这些数据.您可以只使用网址.

To check where the page loads its content from, you should check the network tab of the DevTools. You will notice that the content of the page is loaded from this URL, which seems to provide a perfect representation of the DOM in JSON format. So, there is really no need to scrape that data from the page. You can just use the URL.

这篇关于Puppeteer:从使用延迟加载的页面中抓取整个 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 13:18