html - 如何递归爬网url子目录？

大家好，我正在尝试编写一个Web爬网程序，该爬网程序采用一个网站的主URL并对该网站的子目录进行爬网。我已经在这个问题上停留了很长时间。有人可以帮我吗？提前非常感谢您！

这是我尝试获取的示例输出：

https://www.dintaifung.com.sg/index.php的标题：Din Tai Fung新加坡

https://www.dintaifung.com.sg/about.php的标题：Din Tai Fung-关于我们

https://www.dintaifung.com.sg/ ...的标题：丁大丰-...

等等 ...

var Crawler = require("crawler");

var c = new Crawler({
maxConnections : 10,
// This will be called for each crawled page
callback : function (error, res, done) {
    if(error){
        console.log(error);
    }else{
        var $ = res.$;
        // $ is Cheerio by default
        //a lean implementation of core jQuery designed specifically for the server
        console.log($("title").text());
    }
    done();
}
});

// Queue just one URL, with default callback
c.queue('https://www.dintaifung.com.sg/');

最佳答案

您应该得到a callback for each page crawled。传入的res参数描述每个页面。

为您的回调函数尝试类似的操作。

callback : function (error, res, done) {
    if(error) {
        console.log(error);
    } else {
        const $ = res.$;
        const title = $("title").text());           //find the page's title
        const url = res.request.uri.toString();     //get the fetched URL
        const display = `Title of ${url}: {title}`; //make your display string
        console.log(display);                       //display it
    }
    done();
}

$允许您使用类似jQuery的操作来搜索由搜寻器检索的页面的正文（“遍历文档对象模型”）。如果要查看，原始正文显示在res.body处。但是请记住，对于您抓取的页面，它看起来就像“查看源代码...”。

提示：我建议您使用maxConnections: 2而不是10进行调试吗？爬网会给网站带来沉重的负担，而使用新应用进行爬网会使情况变得更糟。