增加爬虫中的线程数

增加爬虫中的线程数

本文介绍了增加爬虫中的线程数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();
                String text = page.getText();
                List<WebURL> links = page.getURLs();
        }
}

这是调用 MyCrawler 的 Controller.java 的代码..

And this is the code for Controller.java from where MyCrawler is getting called..

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);
        }
}

所以我只想确定这一行在 controller.java 文件中的含义

So I just want to make sure what does this line means in controller.java file

controller.start(MyCrawler.class, 10);

这里 10 的含义是什么......如果我们将这个 10 增加到 20 那么会产生什么效果......任何建议将不胜感激......

here what is the meaning of 10.. And if we Increase this 10 to 20 then what will be the effect... Any suggestions will be appreciated...

推荐答案

这个网站显示了 CrawlController 的源代码.

This website shows the source for CrawlController.

从 10 增加到 20 会增加爬虫的数量(每个都在自己的线程中)- 研究该代码将告诉您这会产生什么影响.

Incrementing from 10 to 20 increases the number of crawlers (each in their own thread) - studying that code will tell you what affect this will have.

这篇关于增加爬虫中的线程数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 21:45