问题描述
我正在使用 Python 和 BeautifulSoup 4 库处理 HTML,但我找不到用空格替换
的明显方法.相反,它似乎被转换为 Unicode 不间断空格字符.
I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace
with a space. Instead it seems to be converted to a Unicode non-breaking space character.
我是否遗漏了一些明显的东西?更换 的最佳方法是什么?使用 BeautifulSoup 的普通空间?
Am I missing something obvious? What is the best way to replace with a normal space using BeautifulSoup?
编辑以添加我使用的是最新版本 BeautifulSoup 4,因此 Beautiful Soup 3 中的 convertEntities=BeautifulSoup.HTML_ENTITIES
选项不可用.
Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIES
option in Beautiful Soup 3 isn't available.
推荐答案
参见 文档中的实体.BeautifulSoup 4 为所有实体生成正确的 Unicode:
See Entities in the documentation. BeautifulSoup 4 produces proper Unicode for all entities:
传入的 HTML 或 XML 实体始终会转换为相应的 Unicode 字符.
是的,
变成了不间断的空格字符.如果您真的希望它们成为空格字符,则必须进行 unicode 替换.
Yes,
is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.
这篇关于如何替换或删除 HTML 实体,如“ "?使用 BeautifulSoup 4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!