问题描述
我认为字符集的名称是Unicode,UTF-8是Unicode字符集的特定编码的名称,但是我经常看到使用的术语编码和字符集引用UTF-8时可以互换。例如,
< meta charset =UTF-8>
vs
<?xml version =1.0encoding =UTF-8?>
UTF-8是一种编码,该术语用于定义它的RFC,它被引用
在Unicode之前,如果要使用像西里尔文或希腊文的字母†,则需要使用仅编码为字符的编码那个字母表。因此,编码和字符集的术语通常是混合的,但它们意味着不同的东西。
现在,Unicode是通常是您需要担心的唯一字符集,因为除了克林贡以外,它包含大部分书面语言的字符。
† - 字母,一种字符集,其中字符直接对应于口语中的声音。
字符集是从代码单元(整数)以书面语言的字符,符号,字形或其他标记。 Unicode是将21b整数映射到unicode代码点的字符集。 Unicode Consortium的因此描述:
编码是从字符串到字符串的映射。 UTF-8是将字节串(8b个整数)映射到码点字符串(21b整数)的编码。 Unicode Consortium将其称为,它在。
UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.
Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.
Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.
† - Alphabet, a kind of character set where characters correspond directly to sounds in a spoken language.
A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:
An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.
这篇关于UTF-8是编码还是字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!