问题描述
如何将utf-8字符写入CSV文件?
How to write utf-8 characters to csv file?
我的数据和代码:
# -*- coding: utf-8 -*-
l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]
thelist = [l1, l2]
import csv
import codecs
with codecs.open('test', 'w', "utf-8-sig") as f:
writer = csv.writer(f)
for x in thelist:
print x
for mem in x:
writer.writerow(mem)
错误消息:
Traceback (most recent call last):
File "2010rudeni priimti.py", line 263, in <module>
writer.writerow(mem)
File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)
按任意键继续. .
我怎么了?
推荐答案
csv
中的模块不读取/写入Unicode,而是读取/写入字节(并假定它们与ASCII兼容,但这对UTF-8来说不是问题).
The csv
module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).
因此,当给它一个要写入的codecs
Unicode文件时,它将传递str
而不是unicode
.当codecs
尝试将其encode
转换为UTF-8时,它必须首先将decode
转换为Unicode,为此它使用您的默认编码(即ASCII)失败.因此,此错误:
So, when you give it a codecs
Unicode file to write to, it passes a str
rather than a unicode
. And when codecs
tries to encode
that to UTF-8, it has to first decode
it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)
该解决方案在文档中进行了说明,并在示例为您处理所有事情.将UnicodeWriter
与普通的二进制文件一起使用,而不是与codecs
文件一起使用.
The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the UnicodeWriter
with a plain binary file, instead of using a codecs
file.
作为替代方案,PyPI上有一些不同的程序包,它们包装了csv
模块以直接在unicode
中而不是在str
中处理,例如 unicodecsv
.
As an alternative, there are a few different packages on PyPI that wrap up the csv
module to deal directly in unicode
instead of str
, like unicodecsv
.
作为更彻底的选择,Python 3.x的csv
模块首先没有出现此问题(而3.x也没有下一个问题).
As a more radical alternative, Python 3.x's csv
module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).
黑客的另一种选择是假装整个世界都是UTF-8.毕竟,您的源代码和输出都应该是UTF-8,而csv
模块只关心少数几个与ASCII兼容的字符(换行符,逗号,可能是引号和反斜杠),所以它什么都不关心.因此,您可以完全跳过解码和编码,一切都会正常进行.显而易见的缺点是,如果您遇到任何错误,那么您将得到一个充满垃圾的文件,而不是调试出错误.
A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the csv
module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.
您的代码还有另外两个问题,UnicodeWriter
或unicodecsv
都不能神奇地解决(尽管Python 3可以解决第一个问题).
There are two other problems with your code, neither of which UnicodeWriter
or unicodecsv
can magically fix (although Python 3 can fix the first).
首先,您实际上并没有真正给予 csv
模块Unicode.源数据中的列是普通的str
文字,例如"žžž"
.您不能将其编码为UTF-8,或者可以,但是只能通过首先将其自动解码为ascii来进行编码,这将再次导致相同的错误.使用u"žžž"
这样的Unicode文字来避免这种情况(或者,如果您愿意,可以从源编码中显式地decode
...但这是很愚蠢的).
First, you're not actually giving the csv
module Unicode in the first place. The columns in your source data are plain old str
literals, like "žžž"
. You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, like u"žžž"
, to avoid this (or, if you prefer, explicitly decode
from your source encoding… but that's kind of silly).
第二,您尚未指定编码声明在您的源代码中,但是您使用了非ASCII字符.从技术上讲,这在Python 2.7中是非法的.实际上,我很确定它会向您发出警告,但随后会将您的来源视为Latin-1.这很不好,因为您显然没有使用Latin-1编辑器(不能将ž
放在Latin-1文本文件中,因为没有这样的字符).如果要将文件另存为UTF-8,然后告诉Python将其解释为Latin-1,则最终将以žžž
而不是žžž
以及类似的mojibake结束.
Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put ž
in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up with žžž
instead of žžž
, and similar mojibake.
这篇关于将utf-8格式的Python列表写入CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!