java - 使用Java提取HTML标签

我想从网页的源代码中提取各种可用的html标记，java中有什么方法可以做到这一点，或者html解析器支持这一点吗？
我想分离所有的html标签。

最佳答案

Java附带了一个XML解析器，其方法与JavaScript中的DOM类似：

DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();

文档生成器可以接受许多不同的输入（输入流、原始html字符串等）。
http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html
如果您需要更多的话，cyber-neko解析器也很好。

关于java - 使用Java提取HTML标签，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/5375028/