本文介绍了为什么在Julia中不建议对UTF8字符串进行索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Julia入门指南在Y分钟内学习Julia ,使用户不愿为UTF8字符串建立索引:

The introductory guide to Julia, Learn Julia in Y Minutes, discourages users from indexing UTF8 strings:

# Some strings can be indexed like an array of characters
"This is a string"[1] # => 'T' # Julia indexes from 1
# However, this is will not work well for UTF8 strings,
# so iterating over strings is recommended (map, for loops, etc).

为什么不鼓励在这样的字符串上进行迭代?这种替代字符串类型的结构具体如何使索引出错?这是Julia的特定陷阱吗?还是扩展到支持UTF8字符串的所有语言?

Why is iterating over such strings discouraged? What specifically about the structure of this alternate string type makes indexing error prone? Is this a Julia specific pitfall, or does this extend to all languages with UTF8 string support?

推荐答案

因为在UTF8中,字符并不总是编码在单个字节中.

Because in UTF8 a character is not always encoded in a single byte.

例如,德语字符串böse(邪恶).此字符串采用UTF8编码的字节为:

Take for example the german language string böse (evil).The bytes of this string in UTF8 encoding are:

0x62 0xC3 0xB6 0x73 0x65
b    ö         s    e

如您所见,变音ö需要2个字节.

As you can see the umlaut ö requires 2 bytes.

现在,如果您直接索引此UTF8编码的字符串"böse"[4],则会得到s而不是e.

Now if you directly index this UTF8 encoded string "böse"[4] will give you sand not e.

但是,您可以在julia中将字符串用作可迭代对象:

However, you can use the string as an iterable object in julia:

julia> for c in "böse"
           println(c)
       end
b
ö
s
e

而且,既然您已经问过了,不,关于UTF8字符串的直接字节索引问题并不特定于Julia.

And since you've asked, No, direct byte indexing issues with UTF8 strings are not specific to Julia.

建议进一步阅读:
http://docs.julialang.org/zh/release-0.4/manual/strings/#unicode-and-utf-8

Recommendation for further reading:
http://docs.julialang.org/en/release-0.4/manual/strings/#unicode-and-utf-8

这篇关于为什么在Julia中不建议对UTF8字符串进行索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-15 12:58