本文介绍了使用Jsoup从html文件中提取标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在对网络文档进行结构分析。为此,我需要仅提取Web文档的结构(仅标记)。我找到了一个名为Jsoup的java的html解析器。但我不知道如何使用它来提取标签。
I am doing a structural analysis on web documents. For this i need to extract only the structure of a web document(only the tags). I found a html parser for java called Jsoup. But I don't know how to use it to extract tags.
示例:
<html>
<head>
this is head
</head>
<body>
this is body
</body>
</html>
输出:
html,head,head,body,body,html
推荐答案
听起来像深度优先遍历:
Sound like a depth-first traversal:
public class JsoupDepthFirst {
private static String htmlTags(Document doc) {
StringBuilder sb = new StringBuilder();
htmlTags(doc.children(), sb);
return sb.toString();
}
private static void htmlTags(Elements elements, StringBuilder sb) {
for(Element el:elements) {
if(sb.length() > 0){
sb.append(",");
}
sb.append(el.nodeName());
htmlTags(el.children(), sb);
sb.append(",").append(el.nodeName());
}
}
public static void main(String... args){
String s = "<html><head>this is head </head><body>this is body</body></html>";
Document doc = Jsoup.parse(s);
System.out.println(htmlTags(doc));
}
}
另一种解决方案是使用jsoup NodeVisitor,如下所示: / p>
another solution is to use jsoup NodeVisitor as follows:
SecondSolution ss = new SecondSolution();
doc.traverse(ss);
System.out.println(ss.sb.toString());
class:
public static class SecondSolution implements NodeVisitor {
StringBuilder sb = new StringBuilder();
@Override
public void head(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
if (sb.length() > 0) {
sb.append(",");
}
sb.append(node.nodeName());
}
}
@Override
public void tail(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
sb.append(",").append(node.nodeName());
}
}
}
这篇关于使用Jsoup从html文件中提取标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!