apache nutch crawler-只检索单个网址

本文介绍了apache nutch crawler-只检索单个网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

INJECT步骤仅检索单个URL-尝试对CNN进行爬网.我使用的是默认配置(下面是坚果"站点)-那会是什么-根据我的值，它不应该是10个文档吗?

INJECT step keeps retrieving only single URL - trying to crawl CNN.I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value?

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>crawler1</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
        <name>solr.server.url</name>
        <value>http://x.x.x.x:8983/solr/collection1</value>
  </property>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-reg
ex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|m
etatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property>
</configuration>

推荐答案

Nutch爬网包含4个基本步骤:生成，提取，解析和更新DB .对于 nutch 1.x 和 nutch 2.x .这四个步骤的执行和完成构成一个抓取周期.

Nutch crawl consists of 4 basic steps: Generate, Fetch, Parse and Update DB. These steps are the same for both nutch 1.x and nutch 2.x. Execution and completion of all four steps make one crawl cycle.

注入器可以是将URL添加到crawldb的第一步.如此处和此处.

Injector can be the very first step that adds the URL to the crawldb; as stated here and here.

我认为您已经提供了cnn.com

Which I reckon you have already provided i.e cnn.com

generate.max.count限制从单个域获取的URL数量，如此处.

generate.max.count limits the number of URLs to be fetched form the single domain as stated here.

现在重要的是您的crawldb拥有来自cnn.com的URL.

Now what matters is how many URLs from cnn.com your crawldb has.

选项1

您的generate.max.count = 10，并且您播种了或向爬网数据库注入10个以上的URL，然后在执行爬网周期时，nutch最多应获取10个以上的URL

You have generate.max.count = 10 and you have seeded or injected more than 10 URLs to the crawldb then on executing crawl cycle, nutch should fetch no more than 10 URLs

选项2

如果仅注入了一个URL，并且仅执行了一个爬网周期，则在第一个周期中，您将仅处理一个文档，因为您的crawldb中只有一个URL.您的crawldb将在每个爬网周期结束时进行更新.因此，在执行第二个爬网周期和第三个爬网周期等等时，nutch应该最多只能解析来自特定域的10个URL.

If you have injected only one URL and you have performed only one crawl cycle then on first cycle you will get only one document processed because only one URL was in your crawldb. Your crawldb will be update at the end of each crawl cycle. So on execution of your second crawl cycle and third crawl cycle and so on, nutch should resolve only up to 10 URLs from a specific domain.

这篇关于apache nutch crawler-只检索单个网址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！