本文介绍了Nutch 1.13 索引链接配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试在使用 Apache Nutch 1.13 和 Solr 4.10.4 进行爬网运行期间提取 webgraph 结构.

I am currently trying to extract the webgraph structure during my crawling run with Apache Nutch 1.13 and Solr 4.10.4.

根据文档,index-links 插件将 outlinksinlinks 添加到集合中.

According to the documentation, the index-links plugin adds outlinks and inlinks to the collection.

我相应地更改了我在 Solr 中的集合(传递了 schema.xml 中的相应字段并重新启动了 Solr),并调整了 solr-mapping 文件,但无济于事.由此产生的错误如下所示.

I have changed my collection in Solr accordingly (passed the respective fields in schema.xml and restarted Solr), as well as adapted the solr-mapping file, but to no avail.The resulting error can be seen below.

bin/nutch index -D solr.server.url=http://localhost:8983/solr/collection1 crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter -normalize -deleteGone
Segment dir is complete: crawl/segments/20170503114357.
Indexer: starting at 2017-05-03 11:47:02
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 1/1 documents
Deleting 0 documents
Indexing 1/1 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

有趣的是,我自己的研究让我假设它实际上是不平凡的,因为结果解析(没有插件)看起来像这样:

Interestingly, my own research led me to the assumption that it is in fact non-trivial, since the resulting parse (without the plugin) looks like this:

bin/nutch indexchecker http://www.my-domain.com/
fetching: http://www.my-domain.com/
robots.txt whitelist not configured.
parsing: http://www.my-domain.com/
contentType: application/xhtml+xml
tstamp :    Wed May 03 11:40:57 CEST 2017
digest :    e549a51553a0fb3385926c76c52e0d79
host :  http://www.my-domain.com/
id :    http://www.my-domain.com/
title : Startseite
url :   http://www.my-domain.com/
content :   bla bla bla bla.

然而,一旦我启用index-links,输出突然看起来像这样:

Yet, once I enable index-links, the output suddenly looks like this:

bin/nutch indexchecker http://www.my-domain.com/
fetching: http://www.my-domain.com/
robots.txt whitelist not configured.
parsing: http://www.my-domain.com/
contentType: application/xhtml+xml
tstamp :    Wed May 03 11:40:57 CEST 2017
outlinks :  http://www.my-domain.com/2-uncategorised/331-links-administratives
outlinks :  http://www.my-domain.com/2-uncategorised/332-links-extern
outlinks :  http://www.my-domain.com/impressum.html
id :    http://www.my-domain.com/
title : Startseite
url :   http://www.my-domain.com/
content :   bla bla bla

显然,这不能放入单个字段,但我只想有一个包含所有 outlinks 的列表(我已经读过 inlinks 不起作用,但无论如何我都不需要它们).

Obviously, this cannot fit into a single field, but I just want to have a single list with all the outlinks (I have read that the inlinks do not work, but I do not need them anyways).

推荐答案

你必须像这样在solrindex-mapping.xml中指定字段

You have to specify the fields in the solrindex-mapping.xml like this

<field dest="inlinks" source="inlinks"/>
<field dest="outlinks" source="outlinks"/>

之后,确保卸载重新加载集合,包括完全重启 Solr.

Afterwards, make sure to unload and reload the collection, including a complete restart of Solr.

您没有具体说明如何在 schema.xml 中实现这些字段,但对我而言,以下方法有效:

You did not specify how exactly you implemented the fields in schema.xml, but for me the following worked:

<!-- fields for index-links plugin -->
<field name="inlinks" type="url" stored="true" indexed="false" multiValued="true"/>
<field name="outlinks" type="url" stored="true" indexed="false" multiValued="true"/>

最好的问候和好运!

这篇关于Nutch 1.13 索引链接配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 13:50