本文介绍了Heroku 错误 R10(引导超时)节点上的 Puppeteer(网络抓取应用程序)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个网页抓取应用,用于检查电子商务网站上的某个问题.

I created a web scraping app, which checks for a certain problem on an ecommerce website.

它的作用:

  • 遍历一组页面
  • 检查每个页面上的条件
  • 如果满足条件 - 将页面推送到临时数组
  • 发送一封带有 temparray 作为正文的电子邮件

我将该函数包装在一个 cronjob 函数中.在我的本地机器上它运行良好.

I wrapped that function in a cronjob function.On my local machine it runs fine.

像这样部署:

  • 无头:真实
  • '--无沙盒',
  • '--disable-setuid-sandbox'
  • 将 pptr buildpack 链接添加到 heroku 中的设置
  • slugsize 是 500 MiB 中的 259.6 MiB

没用.

  • 将启动超时设置为 120 秒(而不是 60 秒)

它奏效了.但是只跑了一次.

It worked. But only ran once.

因为它想每天多次运行该函数,所以我需要解决这个问题.

Since it want to run that function several times per day, I need to fix the issue.

我正在运行另一个应用程序,它使用相同的 cronjob 和通知功能,并且可以在 heroku 上运行.

I have another app running which uses the same cronjob and notification function and it works on heroku.

这是我的代码,如果有人感兴趣.

Here's my code, if anyone is interested.

const puppeteer = require('puppeteer');
const nodemailer = require("nodemailer");
const CronJob = require('cron').CronJob;
let articleInfo ='';
const mailArr = [];
let body = '';

const testArr = [
    'https://bxxxx..', https://b.xxx..', https://b.xxxx..',
];

async function sendNotification() {

    let transporter = nodemailer.createTransport({
      host: 'mail.brxxxxx.dxx',
      port: 587,
      secure: false,
      auth: {
        user: '[email protected]',
        pass: process.env.heyBfPW2
      }
    });

    let textToSend = 'This is the heading';
    let htmlText = body;

    let info = await transporter.sendMail({
      from: '"BB Checker" <hey@baxxxxx>',
      to: "[email protected]",
      subject: 'Hi there',
      text: textToSend,
      html: htmlText
    });
    console.log("Message sent: %s", info.messageId);
  }

async function boxLookUp (item) {
    const browser = await puppeteer.launch({
        headless: true,
        args: [
            '--no-sandbox',
            '--disable-setuid-sandbox',
          ],
    });
    const page = await browser.newPage();
    await page.goto(item);
    const content = await page.$eval('.set-article-info', div => div.textContent);
    const title = await page.$eval('.product--title', div => div.textContent);
    const orderNumber = await page.$eval('.entry--content', div => div.textContent);

    // Check if deliveryTime is already updated
    try {
        await page.waitForSelector('.delivery--text-more-is-coming');
    // if not
      } catch (e) {
        if (e instanceof puppeteer.errors.TimeoutError) {
          // if not updated check if all parts of set are available
          if (content != '3 von 3 Artikeln ausgewählt' && content != '4 von 4 Artikeln ausgewählt' && content != '5 von 5 Artikeln ausgewählt'){
            articleInfo = `${title} ${orderNumber} ${item}`;
            mailArr.push(articleInfo)
            }
        }
      }
    await browser.close();
};

    const checkBoxes = async (arr) => {

    for (const i of arr) {
        await boxLookUp(i);
   }

   console.log(mailArr)
   body = mailArr.toString();
   sendNotification();
}

async function startCron() {

    let job = new CronJob('0 */10 8-23 * * *', function() {  // run every_10_minutes_between_8_and_11
        checkBoxes(testArr);
    }, null, true, null, null, true);
    job.start();
}

startCron();

推荐答案

假设其余代码(nodemailer 等)可以正常工作,我将简化问题,使其完全专注于在 Heroku 中运行预定的 Node Puppeteer 任务.运行简单示例后,您可以重新添加邮件逻辑.

Assuming the rest of the code works (nodemailer, etc), I'll simplify the problem to focus purely on running a scheduled Node Puppeteer task in Heroku. You can re-add your mailing logic once you have a simple example running.

Heroku 使用 simple job 运行计划任务调度自定义时钟进程.

Heroku runs scheduled tasks using simple job scheduling or a custom clock process.

简单的作业调度不会给您太多控制权,但如果您不经常运行它,则更容易设置并且在计费小时数方面可能更便宜.另一方面,自定义时钟将是一个持续运行的过程,因此会消耗数小时.

Simple job scheduling doesn't give you much control, but is easier to set up and potentially less expensive in terms of billable hours if you're running it infrequently. The custom clock, on the other hand, will be a continuously-running process and therefore chew up hours.

自定义时钟进程可以准确地完成您的 cron 任务,因此这可能很适合这种情况.

A custom clock process can do your cron task exactly, so that's probably the natural fit for this case.

对于某些情况,您有时可以通过提前退出或部署多个应用程序来解决简单调度程序以执行更复杂的调度.

For certain scenarios, you can sometimes workaround on the simple scheduler to do more complicated schedules by having it exit early or by deploying multiple apps.

例如,如果您想要每天两次的计划,您可以让两个应用在一天的不同时间运行相同的任务.或者,如果您想每周运行两次任务,请使用简单的调度程序将其安排为每天运行,然后让它检查自己的时间,如果当天不是所需的两天之一,则立即退出.

For example, if you want a twice-daily schedule, you could have two apps that run the same task scheduled at different hours of the day. Or, if you wanted to run a task twice weekly, schedule it to run daily using the simple scheduler, then have it check its own time and exit immediately if the current day isn't one of the two desired days.

无论您使用自定义时钟还是简单的计划任务,请注意长时间运行的任务确实应该由 后台任务,所以下面的例子不是生产就绪的.这留给读者作为练习,并不是 Puppeteer 特有的.

Regardless of whether you use a custom clock or simple scheduled task, note that long-running tasks really should be handled by a background task, so the examples below aren't production-ready. That's left as an exercise for the reader and isn't Puppeteer-specific.

{
  "name": "test-puppeteer",
  "version": "1.0.0",
  "description": "",
  "scripts": {
    "start": "echo 'running'"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "cron": "^1.8.2",
    "puppeteer": "^9.1.1"
  }
}

Procfile

clock:  node clock.js

clock.js:

const {CronJob} = require("cron");
const puppeteer = require("puppeteer");

// FIXME move to a worker task; see https://devcenter.heroku.com/articles/node-redis-workers
const scrape = async () => {
  const browser = await puppeteer.launch({
    args: ["--no-sandbox", "--disable-setuid-sandbox"]
  });
  const [page] = await browser.pages();
  await page.setContent(`<p>clock running at ${Date()}</p>`);
  console.log(await page.content());
  await browser.close();
};

new CronJob({
  cronTime: "30 * * * * *", // run every 30 seconds for demonstration purposes
  onTick: scrape,
  start: true,
});

设置

  1. 安装 Heroku CLI 并使用 Node 和 Puppeteer 构建包创建一个新应用(请参阅此答案):

heroku create
heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835

(将 cryptic-dawn-48835 替换为您的应用名称)

(replace cryptic-dawn-48835 with your app name)

部署:

git init
git add .
git commit -m "initial commit"
heroku git:remote -a cryptic-dawn-48835
git push heroku master

  • 添加时钟进程:

  • Add a clock process:

    heroku ps:scale clock=1
    

  • 使用 heroku logs --tail 验证它是否正在运行.heroku ps:scale clock=0 关闭时钟.

  • Verify that it's running with heroku logs --tail. heroku ps:scale clock=0 turns off the clock.


    简单的调度程序

    package.json:

    同上,但不需要cron.也不需要 Procfile.

    const puppeteer = require("puppeteer");
    
    (async () => {
      const browser = await puppeteer.launch({
        args: ["--no-sandbox", "--disable-setuid-sandbox"]
      });
      const [page] = await browser.pages();
      await page.setContent(`<p>scheduled job running at ${Date()}</p>`);
      console.log(await page.content());
      await browser.close();
    })();
    

    设置

    1. 安装 Heroku CLI 并使用 Node 和 Puppeteer 构建包创建一个新应用(请参阅此答案):

    heroku create
    heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
    heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
    

    (将 cryptic-dawn-48835 替换为您的应用名称)

    (replace cryptic-dawn-48835 with your app name)

    部署:

    git init
    git add .
    git commit -m "initial commit"
    heroku git:remote -a cryptic-dawn-48835
    git push heroku master
    

  • 添加调度程序:

  • Add a scheduler:

    heroku addons:add scheduler:standard -a cryptic-dawn-48835
    

    通过运行配置调度程序:

    Configure the scheduler by running:

    heroku addons:open scheduler -a cryptic-dawn-48835
    

    这将打开一个浏览器,您可以添加一个命令 node task.js 每 10 分钟运行一次.

    This opens a browser and you can add a command node task.js to run every 10 minutes.

    使用 heroku logs --tail 验证它在 10 分钟后是否工作.在线调度器会显示下一次/上一次执行的时间.

    Verify that it worked after 10 minutes with heroku logs --tail. The online scheduler will show the time of next/previous execution.


    请参阅此答案,了解使用 Puppeteer 在 Heroku 上创建基于 Express 的网络应用程序.


    See this answer for creating an Express-based web app on Heroku with Puppeteer.

    这篇关于Heroku 错误 R10(引导超时)节点上的 Puppeteer(网络抓取应用程序)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

  • 08-11 09:44