问题描述
在这一段时间里,我的头脑一直冲了过去,我读了一堆文章,问题并不清楚。我有一堆字符串存储在我的数据库中,想象如下:
Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:
x = '\xd0\xa4'
y = '\x92'
在Python shell中我得到以下内容:
At the Python shell I get the following:
print x
Ф
print y
?
正是我想看到的是什么然而,有以下几点:
Which is exactly what I want to see. However then there is the following:
print unicode(x, 'utf8')
Ф
但不是这样:
unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte
我的感觉是,我们的字符串变得越来越糟糕,因为Django尝试将它们转换为unicode,但我只是猜测在这一点上。赞赏的任何见解或解决方法。
My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.
更新:当我查看包含\x92值的行的数据库时,看到这个字符为'。撇号。我使用Unicode UTF-8编码查看数据库的内容。
UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.
推荐答案
看起来你有一个打字错误;应该是 x ='\xd0\xa4'
。如果您使用实际运行的副本粘贴和输出中出现的内容,它将非常有帮助。
Looks like you have a typo; should be x = '\xd0\xa4'
. It helps very much if you use copy paste of what you actually ran and what appeared on the output.
\x92不是有效的UTF-8字符串。这解释了你得到的例外。
"\x92" is not a valid UTF-8 string. This explains the exception that you got.
更多的谜题是为什么 print y
production
。你在叫什么Python控制台?它似乎以替换模式运行,而代之以?你确定这是一个简单的?而不是白色的?里面有一颗黑色钻石?你为什么这么说 ?正是你期待看到的?
More of a puzzle is why print y
produced ?
. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?
更新:你现在说当我查看包含'\x92'的值,我看到这个字符为'撇号,我使用Unicode UTF-8编码查看数据库的内容。
UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""
这不是撇号。似乎这块数据已经使用cp125X(又名Windows-125X)编码之一编码。说明使用cp1252(通常的嫌疑犯):
That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):
IDLE 2.6.4
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>>
而不是使用Unicode UTF-8编码查看数据库的内容(无论如何),尝试编写一小段Python代码来提取违规字符串,然后执行 print repr(bad_string)
。向我们显示您运行的代码,再加上repr()的输出。还要告诉我们哪个版本的Python,什么平台(基于Windows或者基于unix的)以及什么版本的什么数据库软件。而CREATE TABLE语句的一部分与有关列相关。
Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string)
. Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.
另请阅读和。
这篇关于Python UTF8字符串混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!