问题描述
Julia入门指南在Y分钟内学习Julia ,使用户不愿为UTF8字符串建立索引:
The introductory guide to Julia, Learn Julia in Y Minutes, discourages users from indexing UTF8 strings:
# Some strings can be indexed like an array of characters
"This is a string"[1] # => 'T' # Julia indexes from 1
# However, this is will not work well for UTF8 strings,
# so iterating over strings is recommended (map, for loops, etc).
为什么不鼓励在这样的字符串上进行迭代?这种替代字符串类型的结构具体如何使索引出错?这是Julia的特定陷阱吗?还是扩展到支持UTF8字符串的所有语言?
Why is iterating over such strings discouraged? What specifically about the structure of this alternate string type makes indexing error prone? Is this a Julia specific pitfall, or does this extend to all languages with UTF8 string support?
推荐答案
因为在UTF8中,字符并不总是编码在单个字节中.
Because in UTF8 a character is not always encoded in a single byte.
例如,德语字符串böse
(邪恶).此字符串采用UTF8编码的字节为:
Take for example the german language string böse
(evil).The bytes of this string in UTF8 encoding are:
0x62 0xC3 0xB6 0x73 0x65
b ö s e
如您所见,变音ö
需要2个字节.
As you can see the umlaut ö
requires 2 bytes.
现在,如果您直接索引此UTF8编码的字符串"böse"[4]
,则会得到s
而不是e
.
Now if you directly index this UTF8 encoded string "böse"[4]
will give you s
and not e
.
但是,您可以在julia中将字符串用作可迭代对象:
However, you can use the string as an iterable object in julia:
julia> for c in "böse"
println(c)
end
b
ö
s
e
而且,既然您已经问过了,不,关于UTF8字符串的直接字节索引问题并不特定于Julia.
And since you've asked, No, direct byte indexing issues with UTF8 strings are not specific to Julia.
建议进一步阅读:
http://docs.julialang.org/zh/release-0.4/manual/strings/#unicode-and-utf-8
Recommendation for further reading:
http://docs.julialang.org/en/release-0.4/manual/strings/#unicode-and-utf-8
这篇关于为什么在Julia中不建议对UTF8字符串进行索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!