问题描述
我想使用 Nutch 2.2.1 开发一个主题网络机器人.我想用一些主题关键字创建一个新属性,如下所示:
I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following:
<property>
<name>html.metatitle.keys</name>
<value>movie,actor,firm</value>
<description>
</description>
</property>
推荐答案
有两种不同的解决方案可以解决您的问题:
There are two different solutions available for your problem:
实现自定义的
HtmlParseFilter
插件来过滤页面根据您想要的关键字.有关 Nutch 的更多信息扩展点和为 Nutch 编写自定义插件看看在这些手册中:
Implementing a customized
HtmlParseFilter
plugin to filter pagesbased on your desired keywords. For more information about Nutchextension points and writing customized plugin for Nutch take a lookat these manuals:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
使用索引器根据所需关键字过滤文档;但是,如果您有索引器,则此解决方案可用系统设计架构.在这种情况下,Apache Solr 可以帮助您用于在索引之前过滤文档.在这里你必须实现定制的 UpdateRequestProcessor
.有关更多信息Solr 及其扩展点看看这些页面:
Using an indexer to filter documents based on desired keywords;However, this solution is available if you have indexer in yoursystem design architecture. In this case Apache Solr could help youfor filtering documents before indexing. Here you have to implementa customized UpdateRequestProcessor
. For more information aboutSolr and its extension points take a look at these pages:
https://wiki.apache.org/solr/FrontPage
https://wiki.apache.org/solr/UpdateRequestProcessor
这篇关于纳奇 2.2.1 &HBase - 我可以在 nutch-site.xml 中创建一个新属性吗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!