urllib与elementtree结合在一起

urllib与elementtree结合在一起

本文介绍了urllib与elementtree结合在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用标准Python库中的ElementTree模块来解析简单的HTML时,我遇到了一些问题。这是我的源代码:

I'm having a few problems with parsing simple HTML with use of the ElementTree module out of the standard Python libraries. This is my source code:

from urllib.request import urlopen
from xml.etree.ElementTree import ElementTree

import sys

def main():
    site = urlopen("http://1gabba.in/genre/hardstyle")
    try:
        html = site.read().decode('utf-8')
        xml = ElementTree(html)
        print(xml)
        print(xml.findall("a"))
    except:
        print(sys.exc_info())

if __name__ == '__main__':
    main()

这一切都会失败,我在控制台上得到以下输出:

Either this fails, I get the following output on my console:

<xml.etree.ElementTree.ElementTree object at 0x00000000027D14E0>
(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'findall'",), <traceback object at 0x0000000002910B88>)

因此,当我们查看中,我们将看到ElementTree类具有findall函数。额外的事情:xml.find( a)可以正常工作,但是它返回一个int而不是Element实例。

So xml is indeed an ElementTree object, when we look at the documentation we'll see that the ElementTree class has a findall function. Extra thingie: xml.find("a") works fine, but it returns an int instead of an Element instance.

那么有人可以帮我吗?我误会了什么?

So could anybody help me out? What I am misunderstanding?

推荐答案

ElementTree(html)替换为 ElementTree.fromstring(html),然后将导入语句更改为从xml.etree中导入导入ElementTree

Replace ElementTree(html) with ElementTree.fromstring(html), and change your import statement to say from xml.etree import ElementTree.

这里的问题是ElementTree构造函数不希望将字符串作为其输入-它希望使用 Element 对象。函数xml.etree.ElementTree.fromstring()是从字符串构建ElementTree的最简单方法。

The problem here is that the ElementTree constructor doesn't expect a string as its input -- it expects an Element object. The function xml.etree.ElementTree.fromstring() is the easiest way to build an ElementTree from a string.

我猜想XML解析器不是什么考虑到您正在解析HTML(不一定是有效的XML),您确实需要此任务。您可能要看一下:

I'm guessing that an XML parser isn't what you really want for this task, given that you're parsing HTML (which is not necessarily valid XML). You might want to take a look at:





  • http://www.boddie.org.uk/python/HTML.html
  • Parsing HTML in Python
  • http://www.crummy.com/software/BeautifulSoup/

这篇关于urllib与elementtree结合在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-22 08:59