问题描述
我正在使用 Beautiful Soup 3 解析一些 HTML,但它包含 Beautiful Soup 3 不会自动为我解码的 HTML 实体:
>>>从 BeautifulSoup 导入 BeautifulSoup>>>汤 = BeautifulSoup("<p>£682m</p>")>>>文本 = 汤.find("p").string>>>打印文本£682m如何解码 text
中的 HTML 实体以获得 "£682m"
而不是 "£682m"
.
Python 3.4+
导入html打印(html.unescape('£682m'))
仅供参考 html.parser.HTMLParser.unescape
已弃用,应该在 3.5 中删除,尽管它被错误地留在了.它将很快从语言中删除.
Python 2.6-3.3
您可以使用标准库中的HTMLParser.unescape()
:
- 对于 Python 2.6-2.7,它位于
HTMLParser
- 对于 Python 3,它位于
html.parser
您还可以使用 six
兼容性库来简化导入:
I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text
to get "£682m"
instead of "£682m"
.
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
这篇关于在 Python 字符串中解码 HTML 实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!