


How to write utf-8 characters to csv file?


# -*- coding: utf-8 -*-

l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]

thelist = [l1, l2]

import csv
import codecs

with codecs.open('test', 'w', "utf-8-sig") as f:
   writer = csv.writer(f)
   for x in thelist:
       print x
       for mem in x:


Traceback (most recent call last):
   File "2010rudeni priimti.py", line 263, in <module>
 File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
 File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)

csv 中的模块不读取/写入Unicode,而是读取/写入字节(并假定它们与ASCII兼容,但这对UTF-8来说不是问题).

The csv module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).

因此,当给它一个要写入的codecs Unicode文件时,它将传递str而不是unicode.当codecs尝试将其encode转换为UTF-8时,它必须首先将decode转换为Unicode,为此它使用您的默认编码(即ASCII)失败.因此,此错误:

So, when you give it a codecs Unicode file to write to, it passes a str rather than a unicode. And when codecs tries to encode that to UTF-8, it has to first decode it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)


The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the UnicodeWriter with a plain binary file, instead of using a codecs file.

作为替代方案,PyPI上有一些不同的程序包,它们包装了csv模块以直接在unicode中而不是在str中处理,例如 unicodecsv .

As an alternative, there are a few different packages on PyPI that wrap up the csv module to deal directly in unicode instead of str, like unicodecsv.

作为更彻底的选择,Python 3.x的csv模块首先没有出现此问题(而3.x也没有下一个问题).

As a more radical alternative, Python 3.x's csv module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).


A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the csv module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.

您的代码还有另外两个问题,UnicodeWriterunicodecsv都不能神奇地解决(尽管Python 3可以解决第一个问题).

There are two other problems with your code, neither of which UnicodeWriter or unicodecsv can magically fix (although Python 3 can fix the first).

首先,您实际上并没有真正给予 csv模块Unicode.源数据中的列是普通的str文字,例如"žžž".您不能将其编码为UTF-8,或者可以,但是只能通过首先将其自动解码为ascii来进行编码,这将再次导致相同的错误.使用u"žžž"这样的Unicode文字来避免这种情况(或者,如果您愿意,可以从源编码中显式地decode ...但这是很愚蠢的).

First, you're not actually giving the csv module Unicode in the first place. The columns in your source data are plain old str literals, like "žžž". You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, like u"žžž", to avoid this (or, if you prefer, explicitly decode from your source encoding… but that's kind of silly).

第二,您尚未指定编码声明在您的源代码中,但是您使用了非ASCII字符.从技术上讲,这在Python 2.7中是非法的.实际上,我很确定它会向您发出警告,但随后会将您的来源视为Latin-1.这很不好,因为您显然没有使用Latin-1编辑器(不能将ž放在Latin-1文本文件中,因为没有这样的字符).如果要将文件另存为UTF-8,然后告诉Python将其解释为Latin-1,则最终将以žžž而不是žžž以及类似的mojibake结束.

Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put ž in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up with žžž instead of žžž, and similar mojibake.


