问题描述
我几乎是编程的新手,目前正在尝试使用JSoup构建我的第一个Web爬虫.到目前为止,我已经能够从目标站点的单个页面中获取所需的数据,但是自然地,我想以某种方式遍历整个站点.
I'm pretty much a beginner in programming, currently trying to build my first web scraper using JSoup. So far I am able to get the data that I want from a single page of my target site, but naturally I would like to somehow iterate over the entire site.
JSoup似乎为此提供了某种遍历器/访问器(有什么区别?),但是我绝对不知道如何实现该功能.我知道什么是树和节点,也知道目标站点的结构,但是我不知道如何创建(?)遍历器/访问者对象(?)并使其在我的站点上运行.可能是有些未知的高级Java/oo魔术在起作用吗?
JSoup seems to offer some kind of traversor/visitor (what's the difference?) for that, yet I have absolutely no idea how to make that work. I know what trees and nodes are and know the structure of my target site, but I don't know how to create (?) a traverser/visitor-object(?) and let it run over my site. Could it be that there is some advanced Java/oo magic at work, that I don't know of?
不幸的是,Jsoup食谱和其他线程似乎都没有真正地涵盖这些细节,因此,如果有人可以向正确的方向推动我,我将非常感激.
Unfortunately neither the Jsoup cookbook nor other threads seem to really cover the details, so if someone could nudge me in the right direction I'd be very thankful.
推荐答案
NodeTraversor
将有效地遍历指定根节点下(包括该根节点)的所有节点.它不使用递归,因此大型DOM不会创建stackoverflow.
The NodeTraversor
will efficiently iterate through all nodes under and including a specified root node. It doesn't use recursion so large DOM won't create a stackoverflow.
NodeVisitor
(NV)是 NodeTraversor
(NT). NT每次进入节点时,都会调用NV的head
方法. NT每次离开节点时,都会调用NV的tail
方法.
The NodeVisitor
(NV) is the companion of NodeTraversor
(NT). Each time NT enters a node it calls the head
method of the NV. Each time NT leaves a node, it calls the tail
method of the NV.
NT并将其提供给您.您要做的就是为NT提供NV实施.
NT is ready made and provided to you bythe Jsoup API. All you have to do is to provide NT a NV implementation.
这是NodeVisitor的真实实现,该实现取自 ElasticSearch源代码:
Here is a real life implementation of NodeVisitor taken from ElasticSearch source code:
protected static String convertElementsToText(Elements elements) {
if (elements == null || elements.isEmpty())
return "";
StringBuilder buffer = new StringBuilder();
NodeTraversor nt = new NodeTraversor(new ToTextNodeVisitor(buffer));
for (Element element : elements) {
nt.traverse(element);
}
return buffer.toString().trim();
}
private static final class ToTextNodeVisitor implements NodeVisitor {
final StringBuilder buffer;
ToTextNodeVisitor(StringBuilder buffer) {
this.buffer = buffer;
}
@Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim(); // non breaking space
if (!text.isEmpty()) {
buffer.append(text);
if (!text.endsWith(" ")) {
buffer.append(" ");
}
}
}
}
@Override
public void tail(Node node, int depth) {
}
}
这篇关于如何使用JSoup构建NodeTraversor/NodeVisitor?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!