本文介绍了提取结构松散的Wikipedia文本. html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



Some of the html on wikipedia disambiguation pages is, shall we say, ambiguous, i.e. the links there that connect to specific persons named Corzine are difficult to capture using jsoup because they're not explicitly structured, nor do they live in a particular section as in this example. See the page Corzine page here.

如何获得它们? jsoup是适合此任务的工具吗?

How can I get a hold of them? Is jsoup a suitable tool for this task?


Perhaps I should use regex, but I fear doing that because I want it to be generalizable.

</b> may refer to:</p>
  <li><a href


^this here is standard, maybe I could use regex to match that?

<p><b>Corzine</b> may refer to:</p>
  <li><a href="/wiki/Dave_Corzine" title="Dave Corzine">Dave Corzine</a> (born 1956), basketball player</li>
  <li><a href="/wiki/Jon_Corzine" title="Jon Corzine">Jon Corzine</a> (born 1947), former CEO of <a href="/wiki/MF_Global" title="MF Global">MF Global</a>, former Governor on New Jersey, former CEO of <a href="/wiki/Goldman_Sachs" title="Goldman Sachs">Goldman Sachs</a></li>
 <table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation">


Dave Corzine
Jon Corzine

也许可以匹配部分</b> may refer to:</p><table id="setindexbox"并提取两者之间的所有内容.我猜想<table id="setindexbox"在jsoup中可以很容易地匹配,但是</b> may refer to:</p>应该比较困难,因为<b><p>并不是很明显.

Maybe it would be possible to match the section </b> may refer to:</p> and also <table id="setindexbox" and extract all that's in between. I guess <table id="setindexbox" could be matched easily enough in jsoup, but </b> may refer to:</p> should be more difficule because <b> or <p> are not very distinguished.


      Elements table = docx.select("ul");
      Elements links = table.select("li");

    Pattern ppp = Pattern.compile("table id=\"setindexbox\" ");
    Matcher mmm = ppp.matcher(inputLine);

    Pattern pp = Pattern.compile("</b> may refer to:</p>");
    Matcher mm = pp.matcher(inputLine);
    if (mm.matches())
      for (Element link: links)
          String url = link.attr("href");
          String text = link.text();
          System.out.println(text + ", " + url);




Elements els = doc.select("p ~ ul a:eq(0)");

请参阅: http://try.jsoup.org/~yPvgR0pxvA3oWQSJte4Rfm-lS2Y

正在寻找ul中的第一个A元素(a:eq(0)),它是p的同级.如果还有其他冲突,您也可以执行p:contains(corzine) ~ ul a:eq(0).

That's looking for the first A element (a:eq(0)) in a ul that's a sibling of a p. You could also do p:contains(corzine) ~ ul a:eq(0) if there were other conflicts.

或更一般地说::contains(may refer to) ~ ul a:eq(0)


It's hard to generalize Wikipedia because it's unstructured. But IMHO it's easier to use a parser and CSS selectors than regexes, particularly over time when templates change etc.

这篇关于提取结构松散的Wikipedia文本. html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:09