Puppeteer-如何获取当前页面(应用程序/pdf)作为缓冲区或文件?

本文介绍了Puppeteer-如何获取当前页面(应用程序/pdf)作为缓冲区或文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用Puppeteer( https://github.com/GoogleChrome/puppeteer )，我有一个 应用程序/pdf 的页面.使用headless: false，可通过Chromium PDF查看器加载页面，但我想使用无头格式.如何下载原始.pdf文件或用作其他库的Blob，例如(pdf-parse https://www.npmjs.com/package/pdf-parse )?

Using Puppeteer (https://github.com/GoogleChrome/puppeteer), I have a page that's a application/pdf. With headless: false, the page is loaded though the Chromium PDF viewer, but I want to use headless. How can I download the original .pdf file or use as a blob with another library, such as (pdf-parse https://www.npmjs.com/package/pdf-parse)?

推荐答案

由于Puppeteer当前不支持通过 page.goto() 由于上游问题，则可以使用 page.setRequestInterception() 以启用请求拦截，然后您可以收听 'request' 事件，并在使用请求客户端获取PDF缓冲区之前检测资源是否为PDF.

Since Puppeteer does not currently support navigation to a PDF document in headless mode via page.goto() due to the upstream issue, you can use page.setRequestInterception() to enable request interception, and then you can listen for the 'request' event and detect whether the resource is a PDF before using the request client to obtain the PDF buffer.

获取PDF缓冲区后，您可以使用中止原始的Puppeteer请求，或者如果该请求不是针对PDF的，则可以使用 request.continue() 继续正常请求.

After obtaining the PDF buffer, you can use request.abort() to abort the original Puppeteer request, or if the request is not for a PDF, you can use request.continue() to continue the request normally.

这是一个完整的示例:

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.setRequestInterception(true);

  page.on('request', request => {
    if (request.url().endsWith('.pdf')) {
      request_client({
        uri: request.url(),
        encoding: null,
        headers: {
          'Content-type': 'applcation/pdf',
        },
      }).then(response => {
        console.log(response); // PDF Buffer
        request.abort();
      });
    } else {
      request.continue();
    }
  });

  await page.goto('https://example.com/hello-world.pdf').catch(error => {});

  await browser.close();
})();

这篇关于Puppeteer-如何获取当前页面(应用程序/pdf)作为缓冲区或文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！