问题描述
如果你很高兴忽略代理对(或者等效地,你的应用程序需要基本多语言平面之外的字符的可能性),UTF-16 有一些不错的特性,主要是因为每个代码单元总是需要两个字节,并且每个字节都用一个代码单元表示所有 BMP 字符.
考虑原始类型char
.如果我们使用 UTF-8 作为内存表示并希望处理所有 Unicode 字符,那么它应该有多大?它最多可以有 4 个字节……这意味着我们总是必须分配 4 个字节.到时候我们还不如用UTF-32!
当然,我们可以使用 UTF-32 作为 char
表示,但在 string
表示中使用 UTF-8,随时转换.
UTF-16 的两个缺点是:
- 每个 Unicode 字符的代码单元数是可变的,因为在 BMP 中并非所有字符都是.在表情符号流行之前,这并没有影响日常使用中的许多应用程序.如今,对于消息传递应用等,使用 UTF-16 的开发者确实需要了解代理对.
- 对于纯 ASCII(至少在西方有很多文本),它占用的空间是等效 UTF-8 编码文本的两倍.
(作为旁注,我相信 Windows 对 Unicode 数据使用 UTF-16,并且 .NET 出于互操作的原因效仿是有意义的.不过,这只是将问题推到了一步.)
考虑到代理对的问题,我怀疑如果一种语言/平台是从头开始设计的,没有互操作要求(但基于 Unicode 的文本处理),UTF-16 将不是最佳选择.UTF-8(如果您想要内存效率并且不介意在获取第 n 个字符方面的处理复杂性)或 UTF-32(相反)将是更好的选择.(由于不同的规范化形式,即使到第 n 个字符也有问题".文本很难......)
But when saving vs StreamWriter :
I've seen this sample (broken link removed):
And it looks like utf8
is smaller for some strings while utf-16
is smaller in some other strings.
- So why does .net use
utf16
as default encoding for string andutf8
for saving files?
Thank you.
p.s. Ive already read the famous article
If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.
Consider the primitive type char
. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!
Of course, we could use UTF-32 as the char
representation, but UTF-8 in the string
representation, converting as we go.
The two disadvantages of UTF-16 are:
- The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
- For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.
(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)
Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)
这篇关于为什么.net 对字符串使用UTF16 编码,而使用UTF-8 作为默认保存文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!