请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式 | ulSoup解析这个HTML表并限于lxml的Python的方式

本文介绍了请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经搜索了很多关于BeautifulSoup有的建议LXML作为未来BeautifulSoup，而这是有道理的，我有一个艰难的时间如下表从网页上表的整个列表解析。

我与根据页面上的行数目不同，它是检查的时间感兴趣的三列。一个BeautifulSoup和lxml的解决方案非常AP preciated。这样我可以要求管理员在安装开发LXML。机器。

所需的输出：

 网站上次访问上次加载
http://google.com 2011年1月14日
http://stackoverflow.com 01/10/2011
......更多，如果present

下面是一个混乱的网页中的code样品：

 ＆LT;表边框=2WIDTH =100％＆GT;
                      ＆LT;＆TBODY GT;＆LT; TR＆GT;
                        ＆LT; TD WIDTH =33％级=BoldTD＆GT;网站与LT; / TD＆GT;
                        ＆LT; TD WIDTH =33％级=BoldTD＆GT;上次访问＆LT; / TD＆GT;
                        ＆LT; TD WIDTH =34％级=BoldTD＆GT;最后加载＆LT; / TD＆GT;
                      ＆LT; / TR＆GT;
                      ＆所述; TR＆GT;
                        ＆LT; TD WIDTH =33％＆GT;
                          ＆所述; A HREF =http://google.com＆下; / A＆GT;
                        ＆LT; / TD＆GT;
                        ＆LT; TD WIDTH =33％＆GT; 2011年1月14日
                                ＆LT; / TD＆GT;
                        ＆LT; TD WIDTH =34％＆GT;
                                ＆LT; / TD＆GT;
                      ＆LT; / TR＆GT;
                      ＆所述; TR＆GT;
                        ＆LT; TD WIDTH =33％＆GT;
                          ＆所述; A HREF =http://stackoverflow.com＆下; / A＆GT;
                        ＆LT; / TD＆GT;
                        ＆LT; TD WIDTH =33％＆GT; 01/10/2011
                                ＆LT; / TD＆GT;
                        ＆LT; TD WIDTH =34％＆GT;
                                ＆LT; / TD＆GT;
                      ＆LT; / TR＆GT;
                    ＆LT; / TBODY＆GT;＆LT; /表＆gt;

解决方案

下面是一个使用HTMLParser的一个版本。我试着对的内容。对付它的meta标签和DOCTYPE声明，这两者挫败了ElementTree的版本。

 从进口的HTMLParser的HTMLParser类MyParser（HTMLParser的）：
  高清__init __（个体经营）：
    HTMLParser的.__的init __（个体经营）
    self.line =
    self.in_tr =假
    self.in_table =假  高清handle_starttag（个体经营，标签，ATTRS）：
    如果self.in_table和标签==TR：
      self.line =
      self.in_tr = TRUE
    如果标签=='一'：
     在ATTRS ATTR：
       如果ATTR [0] =='href属性：
         self.line + = ATTR [1] +  高清handle_endtag（个体经营，标签）：
    如果标签=='TR'：
      self.in_tr =假
      如果len（self.line）：
        打印self.line
    ELIF标签==表：
      self.in_table =假  高清handle_data（个体经营，数据）：
    如果数据==网站：
      self.in_table = 1
    ELIF self.in_tr：
      数据= data.strip（）
      如果数据：
        self.line + = data.strip（）+如果__name__ =='__main__'：
  MYP = MyParser（）
  myp.feed（开（'table.html'）。阅读（））

希望这解决了你需要的一切，你能接受这个作为答案。
按要求更新。

I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.

I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.

Desired output :

Website                    Last Visited          Last Loaded
http://google.com          01/14/2011
http://stackoverflow.com   01/10/2011
...... more if present

Following is a code sample from a messy web page :

                   <table border="2" width="100%">
                      <tbody><tr>
                        <td width="33%" class="BoldTD">Website</td>
                        <td width="33%" class="BoldTD">Last Visited</td>
                        <td width="34%" class="BoldTD">Last Loaded</td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://google.com"</a>
                        </td>
                        <td width="33%">01/14/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://stackoverflow.com"</a>
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                    </tbody></table>

解决方案

Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.line = ""
    self.in_tr = False
    self.in_table = False

  def handle_starttag(self, tag, attrs):
    if self.in_table and tag == "tr":
      self.line = ""
      self.in_tr = True
    if tag=='a':
     for attr in attrs:
       if attr[0] == 'href':
         self.line += attr[1] + " "

  def handle_endtag(self, tag):
    if tag == 'tr':
      self.in_tr = False
      if len(self.line):
        print self.line
    elif tag == "table":
      self.in_table = False

  def handle_data(self, data):
    if data == "Website":
      self.in_table = 1
    elif self.in_tr:
      data = data.strip()
      if data:
        self.line += data.strip() + " "

if __name__ == '__main__':
  myp = MyParser()
  myp.feed(open('table.html').read())

Hopefully this addresses everything you need and you can accept this as the answer.Updated as requested.

这篇关于请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！