本文介绍了使用Scala进行网页爬取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只是想知道是否有人知道利用Scala简洁语法的网络抓取库.到目前为止,我已经找到了 Chafe ,但是这似乎没有得到很好的记录和维护.我想知道是否有人在Scala中完成了抓取工作并获得了建议. (我试图集成到现有的Scala框架中,而不是使用用Python编写的抓取工具.)

Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)

推荐答案

首先,JVM中有大量HTML抓取库,您需要做的只是插入其中之一(插入我的库模式).

First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).

我使用的四个是:

  • HtmlUnit-将模拟浏览器,甚至运行Javascript
  • Jericho-如果您要编辑抓取的HTML,则保留格式并保持理想状态
  • NekoHtml
  • JSoup-. Might work

我用过硒,但从未刮过. 斯卡拉(Scala)包裹着硒.

I have used Selenium but never for scraping. Scala has a wrapper around selenium.

我建议在现有的一半Scala库中使用现有的Java库.

I would recommend pimping an existing Java library over some half baked Scala lib.

这篇关于使用Scala进行网页爬取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!