本文介绍了Nutch regex-urlfilter 语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 Nutch v. 1.6 并且它正在正确抓取特定站点,但我似乎无法获得文件 NUTCH_ROOT/conf/regex-urlfilter.txt 的正确语法.

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt.

我要抓取的网站有一个类似这样的网址:

The site I want to crawl has a URL similar to this:

http://www.example.com/foo.cfm

在该页面上有许多与以下模式匹配的链接:

On that page there are numerous links that match the following pattern:

http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976

我也想抓取与上面第二个示例匹配的链接.在我的 regex-urlfilter.txt 中,我有以下内容:

I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following:

+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$

Nutch 匹配第一个并正确抓取它,但似乎没有使用其他过滤器获取链接.我怎样才能让 Nutch 像上面的第二个一样抓取 URL?

Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?

我尝试了以下方法但没有成功:

I have tried the following with no luck:

+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$

在我的 NUTCH_ROOT/urls/nutch 中,我有:

http://www.example.com/foo.cfm/

推荐答案

根据 http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F 你不能有多个 URL(它们将被忽略).怎么样:

According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:

+^http://www.example.com/foo.cfm/(.+)*$

它应该覆盖你的第一行:+^http://www.example.com/foo.cfm$,或者,如果 /,试试:

which should cover your first line: +^http://www.example.com/foo.cfm$ as well, or, if there are problems with /, try:

+^http://www.example.com/foo.cfm//?(.+)*$

其中 //? 应该代表字符 /

Where //? should stand for character / or

这篇关于Nutch regex-urlfilter 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-30 07:38