本文介绍了使用 Java 抓取网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我找不到任何好的基于 Java 的 Web 抓取 API.我需要抓取的站点也没有提供任何 API；我想使用一些 pageID 遍历所有网页，并在它们的 DOM 树中提取 HTML 标题/其他内容.

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

除了网页抓取还有其他方法吗?

Are there ways other than web scraping?

jsoup

提取标题并不困难，您有很多选择，在 Stack Overflow 上搜索Java HTML 解析器".其中之一是 Jsoup.

如果您知道页面结构，则可以使用 DOM 导航页面，请参阅http://jsoup.org/cookbook/extracting-data/dom-navigation

You can navigate the page using DOM if you know the page structure, seehttp://jsoup.org/cookbook/extracting-data/dom-navigation

这是一个很好的库，我在最近的项目中使用过它.

It's a good library and I've used it in my last projects.

这篇关于使用 Java 抓取网页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Any

使用 Java 抓取网页

问题描述

推荐答案

jsoup