本文介绍了Nutch - 不爬行,说“停止在深度=1 - 没有更多的 URL 可以获取";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经很久没有尝试使用 Nutch 爬行了,但它似乎无法运行.我正在尝试为网站构建 SOLR 搜索,并使用 Nutch 在 Solr 中进行爬网和索引.

It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr.

最初存在一些权限问题,但现在已修复.我试图抓取的 URL 是 http://172.30.162.202:10200/,它不可公开访问.它是一个可以从 Solr 服务器访问的内部 URL.我尝试使用 Lynx 浏览它.

There have been some permission problems originally but they have been fixed now. The URL I'm trying to crawl is http://172.30.162.202:10200/, which is not publicly accessible. It is an internal URL that can be reached from the Solr server. I tried browsing it using Lynx.

下面是 Nutch 命令的输出:

Given below is the output from the Nutch command:

[abgu01@app01 local]$ ./bin/nutch crawl /home/abgu01/urls/url1.txt -dir /home/abgu01/crawl -depth 5 -topN 100
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
        at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
        at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
        at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
        at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: /home/abgu01/crawl
rootUrlDir = /home/abgu01/urls/url1.txt
threads = 10
depth = 5
solrUrl=null
topN = 100
Injector: starting at 2012-07-27 15:47:00
Injector: crawlDb: /home/abgu01/crawl/crawldb
Injector: urlDir: /home/abgu01/urls/url1.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-07-27 15:47:03, elapsed: 00:00:02
Generator: starting at 2012-07-27 15:47:03
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/abgu01/crawl/segments/20120727154705
Generator: finished at 2012-07-27 15:47:06, elapsed: 00:00:03
Fetcher: starting at 2012-07-27 15:47:06
Fetcher: segment: /home/abgu01/crawl/segments/20120727154705
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://172.30.162.202:10200/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-07-27 15:47:08, elapsed: 00:00:02
ParseSegment: starting at 2012-07-27 15:47:08
ParseSegment: segment: /home/abgu01/crawl/segments/20120727154705
ParseSegment: finished at 2012-07-27 15:47:09, elapsed: 00:00:01
CrawlDb update: starting at 2012-07-27 15:47:09
CrawlDb update: db: /home/abgu01/crawl/crawldb
CrawlDb update: segments: [/home/abgu01/crawl/segments/20120727154705]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-07-27 15:47:10, elapsed: 00:00:01
Generator: starting at 2012-07-27 15:47:10
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-07-27 15:47:11
LinkDb: linkdb: /home/abgu01/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/abgu01/crawl/segments/20120727154705
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

任何人都可以建议爬行不运行的原因是什么?无论 depthtopN 参数的值如何,它总是以停止在 depth=1 - 不再获取 URL"结束.我认为原因是(查看上面的输出)Fetcher 无法从 URL 中获取任何内容.

Can anyone please suggest what could be the reason for crawl not running? It always ends by saying "Stopping at depth=1 - no more URLs to fetch" irrespective of the value of depth or topN parameters. And I think the reason for it is (looking at the output above) that Fetcher isn't able to fetch any content from the URL.

感谢任何输入!

推荐答案

站点可以通过 robots.txt/或 meta(name="robots" content="noindex") 标签阻止抓取.请检查.

A site could blocks crawling via robots.txt / or meta(name="robots" content="noindex") tag. Please check.

附注.你的日志不清楚:1. java.io.FileNotFoundException:/opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (没有那个文件或目录)
2. solrUrl 没有设置,索引会被跳过...

PS. your log isn't clear:1. java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
2. solrUrl is not set, indexing will be skipped...

这篇关于Nutch - 不爬行,说“停止在深度=1 - 没有更多的 URL 可以获取";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 03:23