本文介绍了在Python 3中从utf-16转换为utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python 3编程,但遇到一个小问题,我在网上找不到任何引用.

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.

据我所知,默认字符串是utf-16,但是我必须使用utf-8,我找不到将默认值转换为utf-8的命令.非常感谢您的帮助.

As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.I'd appreciate your help very much.

推荐答案

在Python 3中,当您使用字符串操作时,有两种重要的数据类型很重要.首先是字符串类,它是一个代表unicode代码点的对象.重要的是该字符串不是字节,而是一个字符序列.其次,有一个bytes类,它只是一个字节序列,通常表示存储在编码中的字符串(如utf-8或iso-8859-15).

In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).

这对您意味着什么?据我了解,您想读写utf-8文件.让我们编写一个程序,用ç"字符替换所有ć"

What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters

def main():
    # Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
    with open('output_file', 'w', encoding='utf-8') as out_file:
        # read every line. We give open() the encoding so it will return a Unicode string.
        for line in open('input_file', encoding='utf-8'):
            #Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
            print(line.replace('ć', 'ç'), out_file)

那么什么时候应该使用字节?不经常.我能想到的一个例子是当您从套接字读取内容时.如果在bytes对象中有此对象,则可以通过执行bytes.decode('encoding')使其成为unicode字符串,反之亦然,可以使用str.encode('encoding')使其成为unicode字符串.但是如前所述,可能您将不需要它.

So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.

仍然很有趣,因为这很有趣,这里是您自己编写所有内容的艰难方法:

Still, because it is interesting, here the hard way, where you encode everything yourself:

def main():
    # Open the file in binary mode. So we are going to write bytes to it instead of strings
    with open('output_file', 'wb') as out_file:
        # read every line. Again, we open it binary, so we get bytes
        for line_bytes in open('input_file', 'rb'):
            #Convert the bytes to a string
            line_string = bytes.decode('utf-8')
            #Replace the characters we want.
            line_string = line_string.replace('ć', 'ç')
            #Make a bytes to print
            out_bytes = line_string.encode('utf-8')
            #Print the bytes
            print(out_bytes, out_file)

有关此主题(字符串编码)的详细信息是 http://www.joelonsoftware.com/articles/Unicode.html .真的推荐阅读!

Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!

来源: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8位

(PS,如您所见,我在这篇文章中没有提到utf-16.我实际上不知道python是否将其用作内部解码,但这是完全无关的.目前,您正在使用一个字符串,您使用的是字符(代码点),而不是字节.

(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

这篇关于在Python 3中从utf-16转换为utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-27 17:03