本文介绍了在Python字符串中解码HTML实体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Beautiful Soup 3解析了一些HTML,但它包含了HTML 3的实体,它不会自动为我解码:

 >>>来自BeautifulSoup import BeautifulSoup 

>>>汤= BeautifulSoup(< p> 682m< / p>)
>>> text = soup.find(p)。string

>>>打印文本
& bound; 682m

如何解码 text 得到682m而不是& pound; 682m

解决方案

Python 3.4 +



<$ c不推荐使用$ c> HTMLParser.unescape ,,尽管它被误保留了。它将很快从语言中删除。而是使用 html.unescape()

  import html 
print(html.unescape('& pound; 682m'))

请参阅




Python 2.6-3.3



您可以使用标准库中的HTML解析器:

 >>>尝试:
...#Python 2.6-2.7
...从HTMLParser导入HTMLParser
...除ImportError:
...#Python 3
。 .. from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('& pound; 682m'))
£682m

请参阅兼容库来简化导入:

 >>> from six.moves.html_parser import HTMLParser 
>>> h = HTMLParser()
>>> print(h.unescape('& pound; 682m'))
£682m


I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

解决方案

Python 3.4+

HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon. Instead, use html.unescape():

import html
print(html.unescape('&pound;682m'))

see https://docs.python.org/3/library/html.html#html.unescape


Python 2.6-3.3

You can use the HTML parser from the standard library:

>>> try:
...     # Python 2.6-2.7
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

See http://docs.python.org/2/library/htmlparser.html

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

这篇关于在Python字符串中解码HTML实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 13:30