有标准的 A-Z、a-z 字符,也有连字符、破折号、引号等.
There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.
Plus, there are all of the international characters, like umlauts, etc.
那么,对于一个以英文为基础的系统,完整的一套是什么?其他语言的集合呢?UTF8、UTF16 等呢?
So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?
Bonus question: How many name fields are needed, and what are their maximum lengths?
There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.
For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.
The problem exists for dashes, probably quotes as well, but also how many other symbols?
W3C 有一篇名为 世界各地的人名 很好地解释了问题(和可能的解决方案)(它最初是 Richard Ishida 的两部分博客文章:第 1 部分 和 第 2 部分)
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
就我个人而言,我会说:支持每个可打印的 Unicode 字符,并且为了安全起见,只提供一个包含完整格式化名称的字段名称".通过这种方式,您可以存储几乎所有形式的名称.您可能需要更结构化的存储,但不要期望能够以结构化形式存储每个组合,因为不同的组合实在太多了.
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.