本文介绍了木偶:如何下载整个网页以供离线使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我该如何使用Google的Puppeteer抓取完整的CSS/JavaScript/media(而不只是HTML)的整个网站?在其他刮削作业成功尝试之后,我想它应该能够.

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

但是,通过在线查看许多出色的示例,没有明显的方法可以这样做.我能找到的最接近的电话是

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

并保存结果,但这将保存一个没有任何非HTML元素的副本.

and saving the results, but that saves a copy without any non-HTML elements.

是否可以保存网页以供Puppeteer脱机使用?

Is there way to save webpages for offline use with Puppeteer?

推荐答案

当前可以通过实验性CDP调用 'Page.captureSnapshot' 使用MHTML 格式:

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

这篇关于木偶:如何下载整个网页以供离线使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 09:29