本文介绍了如何将 Unicode 块与语言/脚本相关联?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找可用于将语言(或更可能是脚本)连接到 Unicode 字符块的资源.这样的资源将用于查找诸如法语中使用了哪些 Unicode 块?"之类的问题.或哪些语言使用 0A80-0AFF (http://unicodinator.com/#Block-Gujarati) 的块?"你知道这样的资源吗?

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

我希望能够在 unicode.org 上轻松找到这些信息.我很快就找到了一张将国家代码与语言联系起来的好表(http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html).但是我花了很多时间四处寻找,但没有找到将 Unicode 块与语言相关联的东西.可能我有一个术语问题阻止我在这里连接点...

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

在这种情况下,我对语言"(Java 语言环境代码或 ISO 639 代码或其他)的确切含义并不挑剔.我也知道可能没有确切的答案,因为例如,除了阿拉伯语块中的字符(http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement).但是肯定有一些表格说这些语言与这些块一起使用"......我对格式(XML,CSV,等等)也不挑剔,我可以轻松地将其转换为我可以用于我的应用程序的数据.再一次,我确实意识到该引用可能会将 Scripts 连接到 Blocks,而不是 Languages(尽管 Scripts 可以映射到 Languages).

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

我确实意识到这将是一个多对多表(因为许多语言使用来自多个块的字符,并且许多块被多种语言使用);我确实意识到这无法准确回答,因为 Unicode 代码点不是特定于语言的 - 但是,这个国家有哪些语言"的问题也不能(对于大多数国家来说,答案可能是大多数"),但是一张表格像这样(http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) 仍然可以创建、有意义且有用.

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

至于为什么我想要这样的东西:我想增强 http://unicodinator.com 带有代码块的全局热图和语言列表;我还有一个我正在修改的游戏概念.除此之外,其他人可能还有很多其他用途(字体创建?启发式、快速、最佳猜测的语言检测,因为 Google Translate API 即将消失?研究项目?).

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

推荐答案

我从 Unicode.org 那里得到了答案!在CLDR子项目中,有文件如:

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

对于每个语言 id,您可以搜索exemplarCharacters":

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[u064B u064C u064D u064E u064F u0650 u0651 u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[u200Cu200Du200Eu200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

或者,有这个页面:http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html 看起来像所有这些.我将努力将这些数据改组为某种 langid -> blockid 映射,在此我可能会知道@borrible 是答案"(而不是让我的答案).

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

这篇关于如何将 Unicode 块与语言/脚本相关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 21:05