问题描述
只是想知道是否有人知道利用Scala简洁语法的网络抓取库.到目前为止,我已经找到了 Chafe ,但是这似乎没有得到很好的记录和维护.我想知道是否有人在Scala中完成了抓取工作并获得了建议. (我试图集成到现有的Scala框架中,而不是使用用Python编写的抓取工具.)
Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)
推荐答案
首先,JVM中有大量HTML抓取库,您需要做的只是插入其中之一(插入我的库模式).
First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).
我使用的四个是:
- HtmlUnit-将模拟浏览器,甚至运行Javascript
- Jericho-如果您要编辑抓取的HTML,则保留格式并保持理想状态
- NekoHtml
- JSoup-. Might work
我用过硒,但从未刮过. 斯卡拉(Scala)包裹着硒.
I have used Selenium but never for scraping. Scala has a wrapper around selenium.
我建议在现有的一半Scala库中使用现有的Java库.
I would recommend pimping an existing Java library over some half baked Scala lib.
这篇关于使用Scala进行网页爬取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!