我是一个使用crawler4j的构建基块构建的简单Web爬虫。我正在尝试在搜寻器爬网时构建字典,然后在构建和解析文本时将其传递给我的主控制器(控制器)。由于未在主类中创建MyCrawler对象(使用MyCrawler.class作为第一个参数),该怎么办?另外,我无法更改controller.start方法。我希望能够使用搜寻器完成后在搜寻器中创建的字典。
我能想到的最好方法是让controller.start接受一个预定义并创建的MyCrawler对象,但是我看不到有任何方法。
下面是我的代码。非常感谢您的帮助!
搜寻器:
public class MyCrawler extends WebCrawler
{
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
public ArrayList<String> dictionary = new ArrayList<String>();
@Override public boolean shouldVisit(Page referringPage, WebURL url)
{
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://lyle.smu.edu/~fmoore"));
}
@Override public void visit(Page page)
{
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if(page.getParseData() instanceof HtmlParseData)
{
HtmlParseData h = (HtmlParseData)page.getParseData();
String text = h.getText();
String[] words = text.split(" ");
for(int i = 0;i < words.length;i++)
{
if(!words[i].equals("") || !words[i].equals(null) || !words[i].equals("\n"))
dictionary.add(words[i]);
}
String html = h.getHtml();
Set<WebURL> links = h.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
System.out.println(text);
}
}
}
控制器:
public class Controller
{
public ArrayList<String> dictionary = new ArrayList<String>();
public static void main(String[] args) throws Exception
{
int numberOfCrawlers = 1;
String crawlStorageFolder = "/data/crawl/root";
CrawlConfig c = new CrawlConfig();
c.setCrawlStorageFolder(crawlStorageFolder);
c.setMaxDepthOfCrawling(-1); //Unlimited Depth
c.setMaxPagesToFetch(-1); //Unlimited Pages
c.setPolitenessDelay(200); //Politeness Delay
PageFetcher pf = new PageFetcher(c);
RobotstxtConfig robots = new RobotstxtConfig();
RobotstxtServer rs = new RobotstxtServer(robots, pf);
CrawlController controller = new CrawlController(c, pf, rs);
controller.addSeed("http://lyle.smu.edu/~fmoore");
controller.start(MyCrawler.class, numberOfCrawlers);
controller.shutdown();
controller.waitUntilFinish();
}
}
最佳答案
让WebCrawlerFactory
创建您的MyCrawler
对象。这应该可以解决问题(至少从4.2版开始)。但是,您的dictionary
应该支持并发访问(简单的ArrayList
不支持!)
// use a factory, instead of supplying the crawler type to pass the dictionary
controller.start(new WebCrawlerFactory<MyCrawler>() {
@Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(dictionary);
}
}, numberOfCrawlers);