本文介绍了如何从Python中的字符串中删除\ xa0?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Beautiful Soup解析HTML文件并调用get_text(),但似乎我剩下很多\ xa0 Unicode表示空格.有没有一种有效的方法来删除Python 2.7中的所有元素,然后将它们更改为空格?我猜更笼统的问题是,有没有办法删除Unicode格式?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

我尝试使用:line = line.replace(u'\xa0',' '),如另一个线程所建议,但是将\ xa0更改为u,所以现在到处都是"u". ):

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

问题似乎已由str.replace(u'\xa0', ' ').encode('utf-8')解决,但是仅执行.encode('utf-8')而没有replace()似乎会导致它吐出甚至更奇怪的字符,例如\ xc2.谁能解释一下?

The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

推荐答案

\ xa0实际上是Latin1(ISO 8859-1),也是chr(160)中的不间断空格.您应该将其替换为空格.

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

当.encode('utf-8')时,它将把unicode编码为utf-8,这意味着每个unicode可以由1到4个字节表示.在这种情况下,\ xa0由2个字节\ xc2 \ xa0表示.

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

阅读 http://docs.python.org/howto/unicode.html.

请注意:这个答案从2012年开始,Python已经发展起来,您现在应该可以使用unicodedata.normalize

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

这篇关于如何从Python中的字符串中删除\ xa0?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!