Python Unicode 编码

本文介绍了Python Unicode 编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 argparse 来读取我的 Python 代码的参数.这些输入之一是文件 [title] 的标题，它可以包含 Unicode 字符.我一直用22少女时代22作为测试字符串.

我需要将输入 title 的值写入文件，但是当我尝试将字符串转换为 UTF-8 时，它总是抛出错误:

UnicodeDecodeError: 'ascii' 编解码器无法解码位置 2 中的字节 0x8f:序数不在范围内(128)

我一直环顾四周，发现我需要我的字符串采用 u"foo" 形式来调用 .encode() .

当我对来自 argparse 的输入运行 type() 时，我看到:

我希望得到以下回复:

我怎样才能以正确的形式获得它?

想法:

修改 argparse 以接收一个 str 但将其存储为 unicode 字符串 u"foo":

parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoding.')

这种方法根本行不通.想法?

编辑 1:

title为22少女时代22的部分示例代码:

inputs = vars(parser.parse_args())标题 = 输入[标题"]打印类型(标题)打印类型(u'foo')title = title.encode('utf8') # 这一行抛出错误印刷标题

解决方案

看起来你的输入数据在 SJIS 编码(日语的传统编码)，它在字节串的第 2 位生成字节 0x8f:

>>>'22少女时代22'.encode('sjis')b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'

(在 Python 3 提示符下)

现在，将字符串转换为UTF-8"，你使用了类似的东西

title.encode('utf8')

问题在于 title 实际上是一个包含 SJIS 编码字符串的字节串.由于 Python 2 中的设计缺陷，字节串可以直接encoded，并且假定字节串是 ASCII 编码的.所以你所拥有的在概念上等同于

title.decode('ascii').encode('utf8')

当然 decode 调用失败了.

在编码为 UTF-8 之前，您应该将 SJIS 显式解码为 Unicode 字符串:

title.decode('sjis').encode('utf8')

正如 Mark Tolonen 指出的那样，您可能正在将字符输入到您的控制台中，而您的控制台编码是非 Unicode 编码.

所以结果你的 sys.stdin.encoding 是 cp932，这是微软的 SJIS 变体.为此，使用

title.decode('cp932').encode('utf8')

您确实应该将您的控制台编码设置为标准的 UTF-8，但我不确定这在 Windows 上是否可行.如果这样做，您可以跳过解码/编码步骤，只需将输入的字节串写入文件.

I am using argparse to read in arguments for my python code. One of those inputs is a title of a file [title] which can contain Unicode characters. I have been using 22少女時代22 as a test string.

I need to write the value of the input title to a file, but when I try to convert the string to UTF-8 it always throws an error:

I have been looking around and see I need my string to be in the form u"foo" to call .encode() on it.

When I run type() on my input from argparse I see:

<type 'str'>

I am looking to get a response of:

<type 'unicode'>

How can I get it in the right form?

Idea:

Modify argparse to take in a str but store it as a unicode string u"foo":

parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')

This approach is not working at all. Thoughts?

Edit 1:

Some sample code where title is 22少女時代22:

inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title

解决方案

It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:

>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'

(At Python 3 prompt)

Now, to "convert the string to UTF-8", you used something like

title.encode('utf8')

The problem is that title is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encoded, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to

title.decode('ascii').encode('utf8')

and of course the decode call fails.

You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:

title.decode('sjis').encode('utf8')

As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.

So it turns out your sys.stdin.encoding is cp932, which is Microsoft's variant of SJIS. For this, use

title.decode('cp932').encode('utf8')

You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.

这篇关于Python Unicode 编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！