本文介绍了Apache Nutch:FetcherJob 在 Gora 深处抛出 NoSuchElementException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开箱即用地运行 Apache Nutch 2.3.1,它使用 Gora 0.6.1.我已按照此处的说明操作:http://wiki.apache.org/nutch/RunNutchInEclipse

它在 InjectorJob 中运行良好.

现在我正在运行 FetcherJob,Gora 使用 MemStore 作为数据存储.我有 gora.properties 包含

gora.datastore.default=org.apache.gora.memory.store.MemStore

这会抛出:

2016-10-02 22:55:54,605 错误 mapreduce.GoraRecordReader (GoraRecordReader.java:nextKeyValue(121)) - 读取 Gora 记录时出错:空2016-10-02 22:55:54,605 INFO mapred.MapTask (MapTask.java:flush(1460)) - 开始刷新地图输出2016-10-02 22:55:54,614 信息 mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - 映射任务执行器完成.2016-10-02 22:55:54,615 警告 mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local874667143_0001java.lang.Exception: java.lang.RuntimeException: java.util.NoSuchElementException在 org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)在 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)引起:java.lang.RuntimeException:java.util.NoSuchElementException在 org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122)在 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)在 org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)在 org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)在 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)在 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)在 org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)在 org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)在 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)在 java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)在 java.lang.Thread.run(Thread.java:745)引起:java.util.NoSuchElementException在 java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)在 org.apache.gora.memory.store.MemStore.execute(MemStore.java:128)在 org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)在 org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:67)在 org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:109)……还有 12 个2016-10-02 22:55:55,383 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1360)) - Job job_local874667143_0001 在 uber 模式下运行:false2016-10-02 22:55:55,385 信息 mapreduce.Job (Job.java:monitorAndPrintJob(1367)) - 映射 0% 减少 0%2016-10-02 22:55:55,387 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - 作业 job_local874667143_0001 因状态失败而失败:NA2016-10-02 22:55:55,396 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - 计数器:0线程main"中的异常 java.lang.RuntimeException:作业失败:name=, jobid=job_local874667143_0001在 org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)在 org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:205)在 org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:251)在 org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:314)在 org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)在 org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:321)

这发生在 Nutch 和 Gora 的深处,我不知道为什么会这样.我尝试升级到 Gora 0.8 但同样的问题.我尝试将 Gora 降级到 0.6,同样的问题.我想切换到另一个数据存储,比如 hBase,但这对于我目前需要的东西来说有点过头了.

请帮我解决这个问题.

解决方案

我确认问题出在 MemStore 中.

在 0.6.1 中有一个错误:https://github.com/apache/gora/blob/apache-gora-0.6.1/gora-core/src/main/java/org/apache/gora/memory/store/MemStore.java#L128

master 已经解决了:https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/memory/store/MemStore.java#L155 ,对 #firstKey() 的访问有一个守卫 #isEmpty()

编辑

如果您想在 Nutch 2.x 中使用 Gora-0.7-SNAPSHOT,也许您可​​以这样做:

  1. 下载 Gora 的 master 分支,版本为 0.7-SNAPSHOT
  2. 在 gora/中执行 mvn install 将其安装到 maven 的本地存储库中
  3. 将此补丁应用到 Nutch:https://paste.apache.org/jjqz 所以 Nutch 2.3.1 将与 Gora 0.7-SNAPSHOT 一起使用
  4. 做 Nutch 的教程

我希望它有效:)

编辑 2

关于使用HBase,做一个本地安装进行实验是很容易的.

  1. Nutch2Tutorial 所述,下载HBase 0.98.8-hadoop2
  2. 将tar.gz文件膨胀到一个目录,例如:/home/you/hbase
  3. cd/home/you/hbase/bin
  4. ./start-hbase.sh

现在您已经启动并运行了 HBase.配置 Nutch:

ivy/ivy.xml:看@Emmanuel关于HBase的ivy依赖配置的评论.

gora.properties:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStoregora.datastore.autocreateschema=truegora.datastore.scanner.caching=100

nutch-site.xml:

<财产><name>storage.data.store.class</name><value>org.apache.gora.hbase.store.HBaseStore</value><description>存储数据的默认类</description></属性></配置>

完成.它将采用 HBase 的所有默认配置:localhost、/tmp/...、blablabla

I'm running Apache Nutch 2.3.1 out of the box, which uses Gora 0.6.1. I've followed the instructions here: http://wiki.apache.org/nutch/RunNutchInEclipse

It ran fine with the InjectorJob.

Now I'm running the FetcherJob, and Gora uses MemStore as a data store. I have gora.properties containing

gora.datastore.default=org.apache.gora.memory.store.MemStore

This throws:

2016-10-02 22:55:54,605 ERROR mapreduce.GoraRecordReader (GoraRecordReader.java:nextKeyValue(121)) - Error reading Gora records: null
2016-10-02 22:55:54,605 INFO  mapred.MapTask (MapTask.java:flush(1460)) - Starting flush of map output
2016-10-02 22:55:54,614 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2016-10-02 22:55:54,615 WARN  mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local874667143_0001
java.lang.Exception: java.lang.RuntimeException: java.util.NoSuchElementException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException
    at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
    at org.apache.gora.memory.store.MemStore.execute(MemStore.java:128)
    at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)
    at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:67)
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:109)
    ... 12 more
2016-10-02 22:55:55,383 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1360)) - Job job_local874667143_0001 running in uber mode : false
2016-10-02 22:55:55,385 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1367)) -  map 0% reduce 0%
2016-10-02 22:55:55,387 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Job job_local874667143_0001 failed with state FAILED due to: NA
2016-10-02 22:55:55,396 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Counters: 0
Exception in thread "main" java.lang.RuntimeException: job failed: name=, jobid=job_local874667143_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
    at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:205)
    at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:251)
    at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:314)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:321)

This happens so deep into Nutch and Gora that I have no idea why it's happening. I tried upgrading to Gora 0.8 but same problem. I tried downgrading Gora to 0.6, same problem. I wanted to switch to another data store like hBase but that's a bit overkill for what I need at this moment.

Please help me figure this out.

解决方案

I confirm the problem is in MemStore.

In 0.6.1 there is a bug:https://github.com/apache/gora/blob/apache-gora-0.6.1/gora-core/src/main/java/org/apache/gora/memory/store/MemStore.java#L128

That is already solved in master: https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/memory/store/MemStore.java#L155 , the access to #firstKey() has a guard #isEmpty()

Edit

If you want to use Gora-0.7-SNAPSHOT with Nutch 2.x, maybe you could have it working doing this:

  1. Download Gora's master branch with version 0.7-SNAPSHOT
  2. Do mvn install in gora/ to install it in maven's local repository
  3. Apply this patch to Nutch: https://paste.apache.org/jjqz so Nutch 2.3.1 will work with Gora 0.7-SNAPSHOT
  4. Do Nutch's tutorial stuff

I hope it works :)

Edit 2

About using HBase, it is quite easy to do a local installation for experimenting.

  1. As stated in Nutch2Tutorial, download HBase 0.98.8-hadoop2
  2. Inflate the tar.gz file in a directory, for example: /home/you/hbase
  3. cd /home/you/hbase/bin
  4. ./start-hbase.sh

Now you have HBase up&running.Configure Nutch:

ivy/ivy.xml: Look at @Emmanuel's comment about HBase's ivy dependence configuration.

gora.properties:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
gora.datastore.autocreateschema=true
gora.datastore.scanner.caching=100

nutch-site.xml:

<configuration>
<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
</configuration>

Done. It will take all the default configurations for HBase: localhost, /tmp/..., blablabla

这篇关于Apache Nutch:FetcherJob 在 Gora 深处抛出 NoSuchElementException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-27 07:33