问题描述
我正在从API获取数据,我在Django中存储并输出一个ndash字符为UTF-8。在原始形式中,如果在文本编辑器中检索和查看,给定的数据单位可能类似于:
我喜欢这个洗涤剂\2020这是非常鼓舞人心的。
(\\\–是& ndash;作为html实体)。
如果我从API中直接得到并在Django中显示,没问题。它在我的浏览器中显示为一个长划线条。我注意到,我必须做 decode('utf-8')
以避免ascii编解码器不能编码字符错误,如果我尝试做一些操作那个文字在我看来虽然如此。根据Django Debug Toolbar的说法,文本将以我喜欢这种洗涤剂,我非常喜欢这种洗涤剂。
当存储到MySQL和通过相同的视图和模板读取输出,但是最终看起来像
我喜欢这种洗涤剂 - 它是如此鼓舞人心的
我的MySQL表设置为 DEFAULT CHARSET = utf8
。
现在,当我通过位于Utf-8的终端中的MysQl监视器从数据库中读取数据时,它显示为
我喜欢这种洗涤剂 - 这是非常鼓舞人心的
(正确 - 显示一个ndash)
当我在python shell中使用mysqldb时,这行是
我喜欢这种洗涤剂\xe2\x80\x93它是如此鼓舞人心的
(这是正确的UTF-8 for ndash)
但是,如果我运行 python manage.py shell
,然后
在[1] :import myproject.myapp.models ThatTable
在[2]中:msg = ThatTable.objects.all()。filter(thefield__contains ='detergent')
在[3]中:msg
[4]:[{'thefield':'我喜欢这种洗涤剂\xc3\xa2\xe2\x82\xac\xe2\x80\x9c它是如此鼓舞人心的'}]
在我看来,Djang o采取 \xe2\x80\x93
表示三个单独的字符,并将其编码为UTF-8到 \xc3\ xa2\xe2\x82\xac\xe2\x80\x9c
。这显示为因为\xe2似乎是â,\x80似乎是€等等。我已经检查,这个是如何发送到模板,以及。
如果您在Python中解码长序列,但使用 decode('utf-8')
结果是 \xe2\\\€\\\
,它们也在浏览器中呈现为â€。尝试解码它会产生一个UnicodeDecodeError。
我按照,据我所知(配置的MySQL)。
任何关于我可能的建议配置错误?
附录似乎在其他区域或系统中也出现了同样的问题,因为在搜索\xc3\xa2 \xe2\x82\xac\xe2\x80\x9c,我在一个修复坏UTF8实体的脚本,也在wordpress RSS导入插件中找到,它只是用&ndash ;.我想解决这个正确的方法,但是!
哦,我正在使用Django 1.2和Python 2.6.5。
我可以使用PHP / PDO连接到同一个数据库,并打印出这些数据,而不做任何特别的操作,看起来不错。
这似乎是一个双重编码的情况;我没有太多的Python经验,但是请尝试根据
我正在猜测的是连接是latin1,所以MySQL尝试在存储到UTF-8字段之前再次对该字符串进行编码。代码在这里,具体是这个位:
可能是你想要的。
I'm having trouble storing and outputting an ndash character as UTF-8 in Django.
I'm getting data from an API. In raw form, as retrieved and viewed in a text editor, given unit of data may be similar to:
"I love this detergent \u2013 it is so inspiring."
(\u2013 is & ndash; as an html entity).
If I get this straight from an API and display it in Django, no problem. It displays in my browser as a long dash. I noticed I have to do decode('utf-8')
to avoid the "'ascii' codec can't encode character" error if I try to do some operations with that text in my view, though. The text is going to the template as "I love this detergent\u2013 it is so inspiring.", according to the Django Debug Toolbar.
When stored to MySQL and read for output through the same view and template, however, it ends up looking like
"I love this detergent â€" it is so inspiring"
My MySQL table is set to DEFAULT CHARSET=utf8
.
Now, when I read the data from the database through the MysQl monitor in a terminal set to Utf-8, it shows up as
"I love this detergent – it is so inspiring"
(correct - shows an ndash)
When I use mysqldb in a python shell, this line is
"I love this detergent \xe2\x80\x93 it is so inspiring"
(this is the correct UTF-8 for an ndash)
However, if I run python manage.py shell
, and then
In [1]: import myproject.myapp.models ThatTable
In [2]: msg=ThatTable.objects.all().filter(thefield__contains='detergent')
In [3]: msg
Out[4]: [{'thefield': 'I love this detergent \xc3\xa2\xe2\x82\xac\xe2\x80\x9c it is so inspiring'}]
It appears to me that Django has taken \xe2\x80\x93
to mean three separate characters, and encoded it as UTF-8 into \xc3\xa2\xe2\x82\xac\xe2\x80\x9c
. This displays as â€" because \xe2 appears to be â, \x80 appears to be €, etc. I've checked and this is how it's being sent to the template, as well.
If you decode the long sequence in Python, though, with decode('utf-8')
, the result is \xe2\u20ac\u201c
which also renders in the browser as â€". Trying to decode it again yields a UnicodeDecodeError.
I've followed the Django suggestions for Unicode, as far as I know (configured MySQL).
Any suggestions on what I may have misconfigured?
addendum It seems this same issue has cropped up in other areas or systems as well., as while searching for \xc3\xa2\xe2\x82\xac\xe2\x80\x9c, I found at http://pastie.org/908443.txt a script to 'repair bad UTF8 entities.', also found in a wordpress RSS import plug in. It simply replaces this sequence with –. I'd like to solve this the right way, though!
Oh, and I'm using Django 1.2 and Python 2.6.5.
I can connect to the same database with PHP/PDO and print out this data without doing anything special, and it looks fine.
This does seem like a case of double-encoding; I don't have much experience with Python, but try adjusting the MySQL connection settings as per the advice at http://tahpot.blogspot.com/2005/06/mysql-and-python-and-unicode.html
What I'm guessing is happening is that the connection is latin1, so MySQL tries to encode the string again before storage to the UTF-8 field. The code there, specifically this bit:
is probably what you want.
这篇关于Django是否双重编码Unicode(utf-8?)字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!