问题描述
我使用Beautiful Soup 3解析了一些HTML,但它包含了HTML 3的实体,它不会自动为我解码:
>>>来自BeautifulSoup import BeautifulSoup
>>>汤= BeautifulSoup(< p> 682m< / p>)
>>> text = soup.find(p)。string
>>>打印文本
& bound; 682m
如何解码 text
得到682m
而不是& pound; 682m
Python 3.4 +
<$ c不推荐使用$ c> HTMLParser.unescape ,,尽管它被误保留了。它将很快从语言中删除。而是使用 html.unescape()
:
import html
print(html.unescape('& pound; 682m'))
请参阅
Python 2.6-3.3
您可以使用标准库中的HTML解析器:
>>>尝试:
...#Python 2.6-2.7
...从HTMLParser导入HTMLParser
...除ImportError:
...#Python 3
。 .. from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('& pound; 682m'))
£682m
请参阅兼容库来简化导入:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('& pound; 682m'))
£682m
I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text
to get "£682m"
instead of "£682m"
.
Python 3.4+
HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon. Instead, use html.unescape()
:
import html
print(html.unescape('£682m'))
see https://docs.python.org/3/library/html.html#html.unescape
Python 2.6-3.3
You can use the HTML parser from the standard library:
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
See http://docs.python.org/2/library/htmlparser.html
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
这篇关于在Python字符串中解码HTML实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!