本文介绍了推荐使用 Lucene 或 Solr 的爬虫工具?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用于处理 HTML 和 XML 文档(本地或基于 Web)并且在 Lucene/Solr 解决方案空间中运行良好的爬虫(蜘蛛)是什么?可以是基于 Java 的,但不一定是.

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.

推荐答案

在我看来,这是一个非常重要的漏洞,它阻碍了 Solr 的广泛采用.新的 DataImportHandler 是导入结构化数据的良好开端,但没有用于 Solr 的良好文档摄取管道.Nutch 确实有效,但 Nutch 爬虫和 Solr 之间的集成有点笨拙.
我已经尝试了我能找到的所有开源爬虫,但没有一个与 Solr 集成开箱即用.
密切关注 OpenPipeline 和 Apache Tika.

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.

这篇关于推荐使用 Lucene 或 Solr 的爬虫工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 08:32