本文介绍了Python UnicodeDecodeError-我是否误解了编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么想法为什么不起作用?我真的以为忽略"会做正确的事.

Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.

>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)

推荐答案

...有一个原因将它们称为编码" ...

…there's a reason they're called "encodings"…

一些前言:将unicode视为规范或理想状态. Unicode只是一个字符表. №65是拉丁首都A.№937是希腊首都欧米茄.就是这样.

A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.

为了使计算机存储和/或操作Unicode,必须将其编码转换为字节. Unicode最简单的 encoding 是UCS-4;每个字符占用4个字节,并且所有〜1000000个字符都可用. 4个字节包含Unicode表中的字符数,为4个字节的整数.另一个非常有用的编码是UTF-8,它可以编码任何带有1-4个字节的Unicode字符.但是也有一些有限的编码,例如"latin1",其中包含的字符范围非常有限,主要由西方国家使用.这样的编码每个字符仅使用一个字节.

In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.

基本上,可以使用许多编码对Unicode进行编码,对编码后的字符串可以进行解码为Unicode.事实是,Unicode来得太晚了,所以我们所有使用8位字符集成长的人都太晚了,以至于这段时间我们一直使用 encoded 字符串.编码可以是ISO8859-1,Windows CP437或CP850,或者是或或,这取决于我们的系统默认设置.

Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.

因此,当您在源代码中输入字符串将监视"添加到列表"(并且我认为您想将字符串将监视"添加到列表",请注意第二个引号)时,您实际上是在使用根据您系统的默认代码页(已通过\ x93字节假定已经使用Windows代码页1252,西方")对已经进行编码的字符串进行了编码.如果要从中获取Unicode,则需要解码"cp1252"编码中的字符串.

So when, in your source code, you enter the string "add "Monitoring" to list" (and I think you wanted the string "add "Monitoring" to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, "Western"). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.

所以,你的意思是:

"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")

不幸的是,Python 2.x也为字符串提供了.encode方法.这是用于特殊"编码(如"zip"或"rot13"或"base64")的便捷功能,与Unicode无关.

It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.

无论如何,对于往返Unicode转换,您只需要记住:

Anyway, all you have to remember for your to-and-fro Unicode conversions is:

  • 将Unicode字符串进行编码到Python 2.x字符串(实际上是一个字节序列)
  • Python 2.x字符串被解码为Unicode字符串
  • a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
  • a Python 2.x string gets decoded to a Unicode string

在两种情况下,您都需要指定将要使用的 encoding .

In both cases, you need to specify the encoding that will be used.

我不是很清楚,我很困,但是我希望我能帮上忙.

I'm not very clear, I'm sleepy, but I sure hope I help.

PS幽默的注解:玛雅人没有Unicode.古罗马人,古希腊人,古埃及人也没有.他们都有自己的编码",对其他文化几乎没有尊重.所有这些文明都崩溃了.想想人!使您的应用程序具有Unicode意识,造福全人类. :)

PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)

PS2请不要通过说但是中国人……"破坏先前的信息.但是,如果您倾向于这样做或有义务这样做,则可以通过认为Unicode BMP主要由中文表意文字填充来推迟它,那么ergo Chinese是Unicode的基础.只要人们开发支持Unicode的应用程序,我就可以继续发明令人毛骨悚然的谎言.干杯!

PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications. Cheers!

这篇关于Python UnicodeDecodeError-我是否误解了编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-31 06:55
查看更多