问题描述
我目前正在使用 Beautiful Soup 来解析 HTML 文件并调用 get_text()
,但似乎我留下了很多 xa0 Unicode 表示空格.有没有一种有效的方法可以在 Python 2.7 中删除所有这些,并将它们更改为空格?我想更普遍的问题是,有没有办法删除 Unicode 格式?
I am currently using Beautiful Soup to parse an HTML file and calling get_text()
, but it seems like I'm being left with a lot of xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
我尝试使用:line = line.replace(u'xa0',' ')
,正如另一个线程所建议的那样,但这将 xa0's 更改为 u's,所以现在我有了你到处都是.):
I tried using: line = line.replace(u'xa0',' ')
, as suggested by another thread, but that changed the xa0's to u's, so now I have "u"s everywhere instead. ):
问题似乎通过 str.replace(u'xa0', ' ').encode('utf-8')
解决,但只是执行 .encode('utf-8')
没有 replace()
似乎导致它吐出更奇怪的字符,例如 xc2 .谁能解释一下?
The problem seems to be resolved by str.replace(u'xa0', ' ').encode('utf-8')
, but just doing .encode('utf-8')
without replace()
seems to cause it to spit out even weirder characters, xc2 for instance. Can anyone explain this?
推荐答案
xa0 实际上是 Latin1 (ISO 8859-1) 中的不间断空格,也是 chr(160).你应该用一个空格代替它.
xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'xa0', u' ')
.encode('utf-8') 时,会将 unicode 编码为 utf-8,这意味着每个 unicode 都可以用 1 到 4 个字节表示.对于这种情况,xa0 由 2 个字节 xc2xa0 表示.
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.
阅读http://docs.python.org/howto/unicode.html一>.
请注意:这个答案是从 2012 年开始的,Python 已经发展了,您现在应该可以使用 unicodedata.normalize
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize
now
这篇关于如何从 Python 中的字符串中删除 xa0?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!