Here is how I open, read and output. The file is an UTF-8 encoded file for unicode characters. I want to print the first 10 UTF-8 characters, but the output from below code snippet print 10 weird unrecognized characters. Wondering if anyone have any ideas how to print correctly? Thanks.

   with open(name, 'r') as content_file:
        content = content_file.read()
        for i in range(10):
            print content[i]


Each of the 10 weird character looks like this,




When Unicode codepoints (characters) are encoded as UTF-8 some codepoints are converted to a single byte, but many codepoints become more than one byte. Characters in the standard 7 bit ASCII range will be encoded as single bytes, but more exotic characters will generally require more bytes to encode.


So you are getting those weird characters because you are breaking up those multi-byte UTF-8 sequences into single bytes. Sometime those bytes will correspond to normal printable characters, but often they won't so you get � printed instead.


Here's a short demo using the ©, ®, and ™ characters, which are encoded as 2, 2, and 3 bytes respectively in UTF-8. My terminal is set to use UTF-8.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)


© ® ™ 9
� '\xc2'
� '\xa9'
  ' '
� '\xc2'
� '\xae'
  ' '
� '\xe2'
� '\x84'
� '\xa2'
© ® ™ 5

Stack Overflow联合创始人Joel Spolsky在Unicode上写了一篇很好的文章:绝对最小值每个软件开发人员绝对,肯定必须了解Unicode和字符集(没有任何借口!)

Stack Overflow co-founder, Joel Spolsky, has written a good article on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

您还应该查看Python中的 Unicode HOWTO 文章文档,以及Ned Batchelder的实用Unicode 文章,又称"Unipain".

You should also take a look at the Unicode HOWTO article in the Python docs, and Ned Batchelder's Pragmatic Unicode article, aka "Unipain".


Here's a short example of extracting individual characters from a UTF-8 encoded byte string. As I mention in the comments, to do this correctly you need to know how many bytes each of the characters is encoded as.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w


0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [™]

FWIW,这是该代码的Python 3版本:

FWIW, here's a Python 3 version of that code:

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

如果我们不知道UTF-8字符串中字符的字节宽度,那么我们需要做更多的工作.每个UTF-8序列都会在第一个字节中对序列的宽度进行编码,如 Wikipedia中所述关于UTF-8的文章.

If we don't know the byte widths of the characters in our UTF-8 string then we need to do a little more work. Each UTF-8 sequence encodes the width of the sequence in the first byte, as described in the Wikipedia article on UTF-8.

下面的Python 2演示展示了如何提取宽度信息.它产生的输出与之前的两个摘要相同.

The following Python 2 demo shows how you can extract that width information; it produces the same output as the two previous snippets.

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
        raise ValueError('%r is not a single byte' % b)

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w


Generally, it should not be necessary to do this sort of thing: just use the provided decoding methods.

出于好奇,这里是get_width的Python 3版本,以及一个手动解码UTF-8字节串的函数.

For the curious, here's a Python 3 version of get_width, and a function that decodes a UTF-8 bytestring manually.

def get_width(b):
    if b <= 0x7f:
        return 1
    elif 0x80 <= b <= 0xbf:
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif 0xc0 <= b <= 0xdf:
        return 2
    elif 0xe0 <= b <= 0xef:
        return 3
    elif 0xf0 <= b <= 0xf7:
        return 4
        raise ValueError('%r is not a single byte' % b)

def decode_utf8(utfbytes):
    start = 0
    uni = []
    while start < len(utfbytes):
        b = utfbytes[start]
        w = get_width(b)
        if w == 1:
            n = b
            n = b & (0x7f >> w)
            for b in utfbytes[start+1:start+w]:
                if not 0x80 <= b <= 0xbf:
                    raise ValueError('Not a continuation byte: %r' % b)
                n <<= 6
                n |= b & 0x3f
        start += w
    return ''.join(uni)

utfbytes = b'\xc2\xa9 \xc2\xae \xe2\x84\xa2'



© ® ™
© ® ™

