web-crawler - 摆脱困境

我正在使用nutch 1.3来爬网网站。我想获取爬网的URL列表以及源自页面的URL。

我得到使用readdb命令爬网的URL列表。

bin/nutch readdb crawl/crawldb -dump file

有没有一种方法可以通过阅读crawldb或linkdb来找出页面上的URL？

在org.apache.nutch.parse.html.HtmlParser中，我看到了outlinks数组，我想知道是否存在从命令行访问它的快速方法。

最佳答案

在命令行中，可以通过将readseg与-dump或-get选项一起使用来查看出站链接。例如，

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump

关于web-crawler - 摆脱困境，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/7425136/