问题描述
我正在运行 Nutch v. 1.6 并且它正在正确抓取特定站点,但我似乎无法获得文件 NUTCH_ROOT/conf/regex-urlfilter.txt
的正确语法.
I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt
.
我要抓取的网站有一个类似这样的网址:
The site I want to crawl has a URL similar to this:
http://www.example.com/foo.cfm
在该页面上有许多与以下模式匹配的链接:
On that page there are numerous links that match the following pattern:
http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976
我也想抓取与上面第二个示例匹配的链接.在我的 regex-urlfilter.txt
中,我有以下内容:
I want to crawl links that match second example above as well. In my regex-urlfilter.txt
I have the following:
+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$
Nutch 匹配第一个并正确抓取它,但似乎没有使用其他过滤器获取链接.我怎样才能让 Nutch 像上面的第二个一样抓取 URL?
Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?
我尝试了以下方法但没有成功:
I have tried the following with no luck:
+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$
在我的 NUTCH_ROOT/urls/nutch
中,我有:
http://www.example.com/foo.cfm/
推荐答案
根据 http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F 你不能有多个 URL(它们将被忽略).只怎么样:
According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:
+^http://www.example.com/foo.cfm/(.+)*$
它应该覆盖你的第一行:+^http://www.example.com/foo.cfm$
,或者,如果 /
,试试:
which should cover your first line: +^http://www.example.com/foo.cfm$
as well, or, if there are problems with /
, try:
+^http://www.example.com/foo.cfm//?(.+)*$
其中 //?
应该代表字符 /
或
Where //?
should stand for character /
or
这篇关于Nutch regex-urlfilter 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!