问题描述
我正在尝试从以下网页中获取一张桌子
到达它,并让我们自己免受未来页面重组的影响.在里面放个大头针 - 我们会回来的.
它可以帮助在调试器中可视化这一点.这是一个实用程序脚本 - 在调试模式下运行它,您将布置好您的 HTML 文档以供探索:
/*** 在编辑器中调试运行它,以便能够探索网页的结构.** 将目标设置为您感兴趣的页面.*/函数 pageExplorer() {var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";var pageTxt = UrlFetchApp.fetch(target).getContentText();var pageDoc = Xml.parse(pageTxt,true);调试器;//在调试器中暂停 - 探索 pageDoc}
这是我们的页面在调试器中的样子:
您可能想知道编号元素是什么,因为您在源代码中看不到它们.当 XML 文档中同一级别的元素类型有多个时,解析器将它们呈现为一个数组,编号为 0..n
.因此,当我们在调试器中看到 div
下的 0
时,就告诉我们在 HTML 源代码中有多个
.div[0]
.好的,理论在我们身后,让我们继续看看我们如何通过蛮力访问该表.
知道层次结构,包括调试器中显示的 div 数组,我们可以做到这一点,ala Phil 之前的回答.我会做一些奇怪的缩进来说明文档结构:
...var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";var pageTxt = UrlFetchApp.fetch(target).getContentText();var pageDoc = Xml.parse(pageTxt,true);var table = pageDoc.getElement().getElement("body").getElements("div")[0]//正文下的第 0 个 div,在调试器中显示.getElements("div")[5]//下面的第 5 个 div.getElement("div")//另一个 div.getElement("table");//最后,我们的表
作为所有那些 .getElement()
调用的更紧凑的替代方案,我们可以使用点表示法进行导航.
var table = pageDoc.getElement().body.div[0].div[5].div.table;
就是这样.
让我们回到那个固定的想法.在调试器中,我们可以看到元素附加了各种属性.特别是,该 div[5] 上有一个id",其中包含包含该表的 div.请记住,在源代码中我们看到了class"属性,但请注意,它们并没有走到这一步.
不过,一个好心的程序员把这个id"放在适当的位置意味着我们可以做到这一点,使用来自前面问题的 getDivById()
:
var contentDiv = getDivById( pageDoc.getElement().body, 'content' );var table = contentDiv.div.table;
如果他们移动东西,我们可能仍然能够找到那个表,而无需更改我们的代码.
一旦有了表格元素,您就已经知道该怎么做了,所以我们到此为止!
I'm trying to grab a table from the following webpage
http://www.bloomberg.com/markets/companies/country/hong-kong/
I have some sample code which was kindly provided by Phil Bozak here:
grabbing table from html using Google script
which grabs the table for this website:
http://www.airchina.com.cn/www/en/html/index/ir/traffic/
As you can see from Phil's code, there is alot of "getElement()" in the code. If i look at the html code for the Air China website. It looks like it's nested four times? that's why the string of .getElement?
Now I look at the source code for the Bloomberg page and its is load with "div"...
the question is can someone show me how to grab the table from this the Bloomberg page?
and just a brief explanation of the theory also would be useful. Thanks a bunch.
Let's flip your question upside down, and start with the theory. Methodology might be a better word for it.
You want to get at something specific in a structured page. To do that, you either need a way to zap right to the element (which can be done if it's labeled in a unique way that we can access), OR you need to navigate the structure more-or-less manually. You already know how to look at the source of a page, so you're familiar with this step. Here's a screenshot of Firefox Inspector, highlighting the element we're interested in.
We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data. We can also see the source:
<table class="ticker_data">
Neat! It's labeled! Unfortunately, that class info gets dropped when we process the HTML in our script. Bummer. If it was id="ticker_data"
instead, we could use the getElementByVal() utility from this answer to reach it, and give ourselves some immunity from future restructuring of the page. Put a pin in that - we'll come back to it.
It can help to visualize this in the debugger. Here's a utility script for that - run it in debug mode, and you'll have your HTML document laid out to explore:
/**
* Debug-run this in the editor to be able to explore the structure of web pages.
*
* Set target to the page you're interested in.
*/
function pageExplorer() {
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
debugger; // Pause in debugger - explore pageDoc
}
This is what our page looks like in the debugger:
You might be wondering what the numbered elements are, since you don't see them in the source. When there are multiples of an element type at the same level in an XML document, the parser presents them as an array, numbered 0..n
. Thus, when we see 0
under a div
in the debugger, that's telling us that there are multiple <div>
tags in the HTML source at that level, and we can access them as an array, for example .div[0]
.
Ok, theory behind us, let's go ahead and see how we can access the table by brute-force.
Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil's previous answer. I'll do some weird indenting to illustrate the document structure:
...
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
var table = pageDoc.getElement()
.getElement("body")
.getElements("div")[0] // 0-th div under body, shown in debugger
.getElements("div")[5] // 5-th div under there
.getElement("div") // another div
.getElement("table"); // finally, our table
As a much more compact alternative to all those .getElement()
calls, we can navigate using dot notation.
var table = pageDoc.getElement().body.div[0].div[5].div.table;
And that's that.
Let's go back to that pinned idea. In the debugger, we can see that there are various attributes attached to elements. In particular, there's an "id" on that div[5] that contains the div that contains the table. Remember, in the source we saw "class" attributes, but note that they don't make it this far.
Still, the fact that a kindly programmer put this "id" in place means we can do this, with getDivById()
from that earlier question:
var contentDiv = getDivById( pageDoc.getElement().body, 'content' );
var table = contentDiv.div.table;
If they move things around, we might still be able to find that table, without changing our code.
You already know what to do once you have the table element, so we're done here!
这篇关于html div 嵌套?使用谷歌fetchurl的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!