本文介绍了在任何时候,以 UTF-8 编码的文本永远不会为我们提供超过以 UTF-16 编码的相同文本的 +50% 的文件大小.真假?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过的某处(改写):

Somewhere I read (rephrased):

如果我们将 UTF-8 编码的文件与 UTF-16 编码的文件进行比较,有时,UTF-8 文件的文件大小可能会大 50% 到 100%

我说这篇文章是错误的是否正确,因为在任何时候,以 UTF-8 编码的文本永远给我们的文件大小不会超过 +50%UTF-16 编码的相同文本?

Am I right to say that the article is wrong because at all times, text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16?

推荐答案

答案是在 UTF-8 中,ASCII 只是 1 个字节,但一般来说,包括英语在内的大多数西方语言都会在这里和那里使用一些字符需要 2 个字节,因此实际百分比有所不同.当以 UTF-8 编码时,希腊语和西里尔语语言的脚本中每个字符都至少需要 2 个字节.

The answer is that in UTF-8, ASCII is just 1 byte, but that in general, most Western languages including English use a few characters here and there that require 2 bytes, so actual percentages vary. The Greek and Cyrillic languages all require at least 2 bytes per character in their script when encoded in UTF-8.

常见的东方语言要求其字符在 UTF-8 中占 3 个字节,在 UTF-16 中占 2 个字节.但是请注意,不常见"的东方字符在 UTF-8 和 UTF-16 中都需要 4 个字节.

Common Eastern languages require for their characters 3 bytes in UTF-8 but 2 in UTF-16. Note however that "uncommon" Eastern characters require 4 bytes in both UTF-8 and UTF-16 alike.

3 确实只比 2 大 50%.但这仅适用于单个代码点.它不适用于整个文件.

3 is indeed only 50% greater than 2. But that is for a single code point only. It does not apply to an entire file.

实际百分比无法精确说明,因为您不知道代码余额是在 1 字节或 2 字节 UTF-8 范围内还是在 4 字节 UTF-8 范围内.如果亚洲文本中有空格,那么这只是 UTF-8 的一个字节,但它是 UTF-16 的一个昂贵的 2 个字节.

The actual percentage is impossible to state with precision, because you do not know whether the balance of code points down in the 1- or 2-byte UTF-8 range, or in the 4-byte UTF-8 range. If there is white space in the Asian text, then that is only byte of UTF-8, and yet it is a costly 2 bytes of UTF-16.

这些事情确实有所不同.您只能在精确文本上获得精确数字,而不能在一般文本上获得精确数字.亚洲文本中的代码点占用 UTF-8 的 1、2、3 或 4 个字节,而在 UTF-16 中,它们每个需要 2 或 4 个字节.

These things do vary. You can only get precise numbers on precise text, not on general text. Code points in Asian text take 1, 2, 3, or 4 bytes of UTF-8, while in UTF-16 they variously require 2 or 4 bytes apiece.

比较东京上各种语言的维基百科页面,看看我的意思.即使在东方语言中,仍然有大量的 ASCII 出现.仅此一项就会使您的数字波动.考虑:

Compare the various languages’ Wikipedia pages on Tokyo to see what I mean. Even in Eastern languages, there is still plenty of ASCII going on. This alone makes your figures fluctuate. Consider:

Paras Lines Words Graphs Chars  UTF16 UTF8   8:16 16:8  Language

 519  1525  6300  43120 43147  86296 44023   51% 196%  English
 343   728  1202   8623  8650  17302  9173   53% 189%  Welsh
 541  1722  9013  57377 57404 114810 59345   52% 193%  Spanish
 529  1712  9690  63871 63898 127798 67016   52% 191%  French
 321   837  2442  18999 19026  38054 21148   56% 180%  Hungarian

 202   464   976   7140  7167  14336 11848   83% 121%  Greek
 348   937  2938  21439 21467  42936 36585   85% 117%  Russian

 355   788   613   6439  6466  12934 13754  106%  94%  Chinese, simplified
 209   419   243   2163  2190   4382  3331   76% 132%  Chinese, traditional
 461  1127  1030  25341 25368  50738 65636  129%  77%  Japanese
 410   925  2955  13942 13969  27940 29561  106%  95%  Korean

每一个都是东京维基百科页面保存为文本,不是作为 HTML.所有文本都在 NFC 中,而不是在 NFD 中.每一列的含义如下:

Each of those is the Tokyo Wikipedia page saved as text, not as HTML. All text is in NFC, not in NFD. The meaning of each of the columns is as follows:

  1. Paras 是空行分隔的文本跨度数.
  2. Lines 是换行分隔的文本跨度数.
  3. Words 是空格分隔的文本跨度数.
  4. Graphs 是 Unicode 扩展字素簇的数量,有时也称为字形.这些是用户可见的字符.
  5. Chars 是 Unicode 代码点的数量.这些是或应该是程序员可见的字符.
  6. UTF16 是文件以 UTF-16 存储时占用的字节数.
  7. UTF8 是文件存储为 UTF-8 时占用的字节数.
  8. 8:16 是 UTF-8 大小与 UTF-16 大小的比率,以百分比表示.
  9. 16:8 是 UTF-16 大小与 UTF-8 大小的比率,以百分比表示.
  10. 语言是我们在这里讨论的东京页面的哪个版本.

我将语言分为西拉丁语、西方非拉丁语和东方语.观察:

I’ve grouped the languages into Western Latin, Western non-Latin, and Eastern. Observations:

  1. 使用拉丁文字的西方语言在从 UTF-8 转换为 UTF-16 时受到的影响很大,英语受到的影响最大,扩展了 96%,匈牙利语的影响最小,扩展了 80%.都是巨大的.

  1. Western languages that use the Latin script suffer terribly upon conversion from UTF-8 to UTF-16, with English suffering the most by expanding by 96% and Hungarian the least by expanding by 80%. All are huge.

不使用拉丁文字的西方语言仍然受到影响,但只有 15-20%.

Western languages that do not use the Latin script still suffer, but only 15-20%.

东方语言不会像大家声称的那样在 UTF-8 中受苦看哪:

Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims that they do! Behold:

  • 在韩语和(简体)中文中,UTF-8 的大小仅比 UTF-16 大 6%.
  • 在日语中,UTF-8 的大小仅比 UTF-16 大 29%.
  • 繁体中文实际上在UTF-8中比在UTF-16中小了!实际上,对于此示例,使用 UTF-16 比使用 UTF-8 花费 32%.如果您查看 Lines 和 Words 列,看起来这可能是由于使用了空格.
  • In Korean and in (simplified) Chinese, you get only 6% bigger in UTF-8 than in UTF-16.
  • In Japanese, you get only 29% bigger in UTF-8 than in UTF-16.
  • The traditional Chinese actually got smaller in UTF-8 than in UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this sample. If you look at the Lines and Words columns, it looks that this might be due to white space usage.

我希望能回答你的问题.与使用 UTF-16 编码的相同文本相比,使用 UTF-8 编码的东方语言的大小没有 +50% 到 +100% 的增加.只有在获取单个代码点时,您才会看到这样的数字,这是一个完全不合理的指标.

I hope that answers your question. There is simply no +50% to +100% size increase for Eastern languages when encoded in UTF-8 compared to when these same texts are encoded in UTF-16. Only when taking individual code points do you ever see numbers like that, which is a completely unreasonable metric.

这篇关于在任何时候,以 UTF-8 编码的文本永远不会为我们提供超过以 UTF-16 编码的相同文本的 +50% 的文件大小.真假?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-17 05:59
查看更多