本文介绍了在 Python 字符串中解码 HTML 实体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Beautiful Soup 3 解析一些 HTML,但它包含 Beautiful Soup 3 不会自动为我解码的 HTML 实体:

>>>从 BeautifulSoup 导入 BeautifulSoup>>>汤 = BeautifulSoup("<p>&pound;682m</p>")>>>文本 = 汤.find("p").string>>>打印文本&pound;682m

如何解码 text 中的 HTML 实体以获得 "£682m" 而不是 "&pound;682m".

解决方案

Python 3.4+

使用html.unescape():

导入html打印(html.unescape('&pound;682m'))

仅供参考 html.parser.HTMLParser.unescape 已弃用,应该在 3.5 中删除,尽管它被错误地留在了.它将很快从语言中删除.

Python 2.6-3.3

您可以使用标准库中的HTMLParser.unescape():

>>>尝试:... # Python 2.6-2.7...从 HTMLParser 导入 HTMLParser...除了导入错误:... # Python 3...从 html.parser 导入 HTMLParser...>>>h = HTMLParser()>>>打印(h.unescape('&pound;682m'))6.82 亿英镑

您还可以使用 six 兼容性库来简化导入:

>>>从 Six.moves.html_parser 导入 HTMLParser>>>h = HTMLParser()>>>打印(h.unescape('&pound;682m'))6.82 亿英镑

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

解决方案

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

这篇关于在 Python 字符串中解码 HTML 实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 13:30