本文介绍了如何使用JSoup构建NodeTraversor/NodeVisitor?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎是编程的新手,目前正在尝试使用JSoup构建我的第一个Web爬虫.到目前为止,我已经能够从目标站点的单个页面中获取所需的数据,但是自然地,我想以某种方式遍历整个站点.

I'm pretty much a beginner in programming, currently trying to build my first web scraper using JSoup. So far I am able to get the data that I want from a single page of my target site, but naturally I would like to somehow iterate over the entire site.

JSoup似乎为此提供了某种遍历器/访问器(有什么区别?),但是我绝对不知道如何实现该功能.我知道什么是树和节点,也知道目标站点的结构,但是我不知道如何创建(?)遍历器/访问者对象(?)并使其在我的站点上运行.可能是有些未知的高级Java/oo魔术在起作用吗?

JSoup seems to offer some kind of traversor/visitor (what's the difference?) for that, yet I have absolutely no idea how to make that work. I know what trees and nodes are and know the structure of my target site, but I don't know how to create (?) a traverser/visitor-object(?) and let it run over my site. Could it be that there is some advanced Java/oo magic at work, that I don't know of?

不幸的是,Jsoup食谱和其他线程似乎都没有真正地涵盖这些细节,因此,如果有人可以向正确的方向推动我,我将非常感激.

Unfortunately neither the Jsoup cookbook nor other threads seem to really cover the details, so if someone could nudge me in the right direction I'd be very thankful.

推荐答案

NodeTraversor将有效地遍历指定根节点下(包括该根节点)的所有节点.它不使用递归,因此大型DOM不会创建stackoverflow.

The NodeTraversor will efficiently iterate through all nodes under and including a specified root node. It doesn't use recursion so large DOM won't create a stackoverflow.

NodeVisitor (NV)是 NodeTraversor (NT). NT每次进入节点时,都会调用NV的head方法. NT每次离开节点时,都会调用NV的tail方法.

The NodeVisitor (NV) is the companion of NodeTraversor (NT). Each time NT enters a node it calls the head method of the NV. Each time NT leaves a node, it calls the tail method of the NV.

NT并将其提供给您.您要做的就是为NT提供NV实施.

NT is ready made and provided to you bythe Jsoup API. All you have to do is to provide NT a NV implementation.

这是NodeVisitor的真实实现,该实现取自 ElasticSearch源代码:

Here is a real life implementation of NodeVisitor taken from ElasticSearch source code:

protected static String convertElementsToText(Elements elements) {
    if (elements == null || elements.isEmpty())
      return "";
    StringBuilder buffer = new StringBuilder();
    NodeTraversor nt = new NodeTraversor(new ToTextNodeVisitor(buffer));
    for (Element element : elements) {
      nt.traverse(element);
    }
    return buffer.toString().trim();
}

private static final class ToTextNodeVisitor implements NodeVisitor {
    final StringBuilder buffer;

    ToTextNodeVisitor(StringBuilder buffer) {
      this.buffer = buffer;
    }

    @Override
    public void head(Node node, int depth) {
      if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        String text = textNode.text().replace('\u00A0', ' ').trim(); // non breaking space
        if (!text.isEmpty()) {
          buffer.append(text);
          if (!text.endsWith(" ")) {
            buffer.append(" ");
          }
        }
      }
    }

    @Override
    public void tail(Node node, int depth) {
    }
}

这篇关于如何使用JSoup构建NodeTraversor/NodeVisitor?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-16 06:50