问题描述
我一直在看GNU LibC给各种分隔符提供的< wctype.h>
标志.基本上有两组.
I've been looking at the <wctype.h>
flags given to various separator characters by GNU LibC. There are two groups, basically.
第一组在 iswspace()
和 iswblank()
(和 isprint()
)上返回true,但对于另一组则为true组).这些包括:
The first group returns true on iswspace()
and iswblank()
(and isprint()
, but that is true for the other group as well). These include:
- U + 0020空间
- U + 1680奥汉姆宇航员标记 >
- U + 2000 EN QUAD
- U + 2001 EM QUAD
- U + 2002 EN SPACE
- U + 2003 EM SPACE
- U + 2004三次电磁兼容空间
- U + 2005超级四频空间
- U + 2006 SIX-PER-EM SPACE
- U + 2008标点空间
- U + 2009瘦空间
- U + 200a头发空间
- U + 205f中等数学空间
- U + 3000思想空间
- U+0020 SPACE
- U+1680 OGHAM SPACE MARK
- U+2000 EN QUAD
- U+2001 EM QUAD
- U+2002 EN SPACE
- U+2003 EM SPACE
- U+2004 THREE-PER-EM SPACE
- U+2005 FOUR-PER-EM SPACE
- U+2006 SIX-PER-EM SPACE
- U+2008 PUNCTUATION SPACE
- U+2009 THIN SPACE
- U+200a HAIR SPACE
- U+205f MEDIUM MATHEMATICAL SPACE
- U+3000 IDEOGRAPHIC SPACE
到目前为止,没有投诉.不过,其他小组让我感到困惑:
No complaints so far. The other group has me puzzled, though:
这些在 iswspace()
和 iswblank()
上返回 false ,但是对于 iswpunct返回 true ()
和 iswgraph()
.
These return false on iswspace()
and iswblank()
, but true for iswpunct()
and iswgraph()
.
为什么最后三个标点符号而不是空格?
Java对此显然同意GLibC(请参阅链接页面). Unicode 将这两个组都标记为类别"Zs","Space_Separator" ...
Java agrees with GLibC on this, apparently (see linked pages). Unicode labels both groups as category 'Zs', "Space_Separator"...
推荐答案
ISO/IEC 30112 信息技术-文化惯例的规范方法状态,重点是:
ISO/IEC 30112 Information technology -- Specification methods for cultural conventions states, emphasis mine:
定义要归类为空格字符的字符,查找语法边界.[...]该类应不包括NO-BREAK空格字符 < U00A0>
,< U2007>
,< UFEFF>
,因为这些字符不应用于单词边界.
Define characters to be classified as white-space characters, to find syntactical boundaries. [...] The class should not include the NO-BREAK spaces characters <U00A0>
, <U2007>
, <UFEFF>
, as these characters should not be used for word boundaries.
这篇关于为什么要使用“不间断空间"?GLibC中的其他ispunct()吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!