问题描述
Tutkish字符'ÇçĞğİıÖööŞşÜü'在utf-8编码中似乎未正确定义,但未正确处理.所有字符的用法均为65533(可替换字符,可能用于错误显示),并根据所选字体显示问号或方框.在某些情况下,0/null作为字符码返回.在互联网上,有很多工具可以给它们提供utf-8定义,但是我不确定这些工具是否使用任何已定义的(实际/国际)注册表,或者使用已知的规则和计算来动态创建定义.它们的字体定义明确,当我们手动输入代码点时,显示它们没有问题.这证明它们是在utf-8中定义的.但是另一方面,它们不是以诸如ajax请求/响应之类的编码或变形形式处理的.
Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
因此,基本问题是我们如何为字符定义代码点"?为了防止误解,可以对问题进行如下调整.假设我们已经像这样为Ç"准备了编码数据->字符:Ç角色名称:带塞迪拉的拉丁文大写字母C十六进制代码点:00C7十进制代码点:199十六进制UTF-8字节:C387......我们在哪里/如何将该信息保存为标准utf-8字符?我们如何分发/暴露它(准备好供他人使用)?我们是否需要任何人/基金会的任何确认(例如unicode/utf-8联盟)如果已注册但无法正常工作,我们如何检测/修复错误?我们可以使用custom-utf8配置吗?如果是,怎么办?
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->Character : ÇCharacter name : LATIN CAPITAL LETTER C WITH CEDILLAHex code point : 00C7Decimal code point : 199Hex UTF-8 bytes : C387......Where/How can we save this info to be a standard utf-8 char?How can we distribute/expose it (make ready to be used by others) ?Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)How can we detect/fixup errors if they are already registered but not working correctly?Can we have custom-utf8 configuration? If yes how?
注意:此处不需要使用代码段,因为这不是使用错误的问题.
Note : No code snippet is needed here as it is not mis-usage problem.
推荐答案
您提到的字符以Unicode表示.以下是它们的字符代码(以十六进制表示)以及如何以UTF-8进行编码:
The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
这意味着,例如,如果将字节0xc4 0x9e写入文件,则已写入字符Ğ,并且任何理解UTF-8的软件工具必须读取它以Ğ的形式返回.
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
更新:要使用土耳其语正确进行字母顺序和大小写转换,您必须使用能够理解语言环境的库,就像使用任何其他自然语言一样.例如,在Java中:
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
请注意,大写字母i变为İ,小写字母I变为ı.您没有说使用哪种编程语言,但可以肯定其标准库也支持语言环境.
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode定义了每个字符的代码点和某些属性(例如,如果是数字或字母,则为字母(如果是大写,小写或标题),并且定义了用于处理Unicode文本的某些通用算法(例如,如何混合从右到左的文本和从左到右的文本).字母顺序和正确的大小写转换由国家标准化机构定义,例如芬兰的芬兰语言学会,,独立于Unicode.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
更新2:
世界上大多数语言(不仅仅是土耳其语)的小写字母测试((ch&0x20)==ch)
都是无效的.您提到的将大写转换为小写的算法也是如此.同样,是否为字母的测试也不正确:在许多语言中,Z都不是字母的最后一个字母.为了正确处理文本,您必须使用由知道自己在做什么的人编写的库函数.
The test ((ch&0x20)==ch)
for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode应该是通用的.创建国家和语言特定的编码变体是导致Unicode试图解决的混乱局面的原因.不幸的是,目前尚无用于订购字符的通用标准.例如,英语为a =ä< z,但瑞典文中的< &一种.在德语中,Ü在一个标准上等同于U,在另一个标准上等同于UE.在芬兰语Ü=Y.无法对代码点进行排序,因此每种语言的排序都是正确的.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.
这篇关于如何为土耳其特殊字符(非ascii)定义/声明utf-8代码点,以将其用作标准utf-8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!