本文介绍了将utf-8格式的Python列表写入CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将utf-8字符写入CSV文件?

How to write utf-8 characters to csv file?

我的数据和代码:

# -*- coding: utf-8 -*-

l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]

thelist = [l1, l2]

import csv
import codecs

with codecs.open('test', 'w', "utf-8-sig") as f:
   writer = csv.writer(f)
   for x in thelist:
       print x
       for mem in x:
           writer.writerow(mem)

错误消息:

Traceback (most recent call last):
   File "2010rudeni priimti.py", line 263, in <module>
writer.writerow(mem)
 File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
 File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)

按任意键继续. .

我怎么了?

推荐答案

csv 中的模块不读取/写入Unicode,而是读取/写入字节(并假定它们与ASCII兼容,但这对UTF-8来说不是问题).

The csv module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).

因此,当给它一个要写入的codecs Unicode文件时,它将传递str而不是unicode.当codecs尝试将其encode转换为UTF-8时,它必须首先将decode转换为Unicode,为此它使用您的默认编码(即ASCII)失败.因此,此错误:

So, when you give it a codecs Unicode file to write to, it passes a str rather than a unicode. And when codecs tries to encode that to UTF-8, it has to first decode it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)

该解决方案在文档中进行了说明,并在示例为您处理所有事情.将UnicodeWriter与普通的二进制文件一起使用,而不是与codecs文件一起使用.

The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the UnicodeWriter with a plain binary file, instead of using a codecs file.

作为替代方案,PyPI上有一些不同的程序包,它们包装了csv模块以直接在unicode中而不是在str中处理,例如 unicodecsv .

As an alternative, there are a few different packages on PyPI that wrap up the csv module to deal directly in unicode instead of str, like unicodecsv.

作为更彻底的选择,Python 3.x的csv模块首先没有出现此问题(而3.x也没有下一个问题).

As a more radical alternative, Python 3.x's csv module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).

黑客的另一种选择是假装整个世界都是UTF-8.毕竟,您的源代码和输出都应该是UTF-8,而csv模块只关心少数几个与ASCII兼容的字符(换行符,逗号,可能是引号和反斜杠),所以它什么都不关心.因此,您可以完全跳过解码和编码,一切都会正常进行.显而易见的缺点是,如果您遇到任何错误,那么您将得到一个充满垃圾的文件,而不是调试出错误.

A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the csv module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.

您的代码还有另外两个问题,UnicodeWriterunicodecsv都不能神奇地解决(尽管Python 3可以解决第一个问题).

There are two other problems with your code, neither of which UnicodeWriter or unicodecsv can magically fix (although Python 3 can fix the first).

首先,您实际上并没有真正给予 csv模块Unicode.源数据中的列是普通的str文字,例如"žžž".您不能将其编码为UTF-8,或者可以,但是只能通过首先将其自动解码为ascii来进行编码,这将再次导致相同的错误.使用u"žžž"这样的Unicode文字来避免这种情况(或者,如果您愿意,可以从源编码中显式地decode ...但这是很愚蠢的).

First, you're not actually giving the csv module Unicode in the first place. The columns in your source data are plain old str literals, like "žžž". You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, like u"žžž", to avoid this (or, if you prefer, explicitly decode from your source encoding… but that's kind of silly).

第二,您尚未指定编码声明在您的源代码中,但是您使用了非ASCII字符.从技术上讲,这在Python 2.7中是非法的.实际上,我很确定它会向您发出警告,但随后会将您的来源视为Latin-1.这很不好,因为您显然没有使用Latin-1编辑器(不能将ž放在Latin-1文本文件中,因为没有这样的字符).如果要将文件另存为UTF-8,然后告诉Python将其解释为Latin-1,则最终将以žžž而不是žžž以及类似的mojibake结束.

Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put ž in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up with žžž instead of žžž, and similar mojibake.

这篇关于将utf-8格式的Python列表写入CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 11:35