本文介绍了为什么要使用“不间断空间"?GLibC中的其他ispunct()吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在看GNU LibC给各种分隔符提供的< wctype.h> 标志.基本上有两组.

I've been looking at the <wctype.h> flags given to various separator characters by GNU LibC. There are two groups, basically.

第一组在 iswspace() iswblank()(和 isprint())上返回true,但对于另一组则为true组).这些包括:

The first group returns true on iswspace() and iswblank() (and isprint(), but that is true for the other group as well). These include:

  • U+0020 SPACE
  • U+1680 OGHAM SPACE MARK
  • U+2000 EN QUAD
  • U+2001 EM QUAD
  • U+2002 EN SPACE
  • U+2003 EM SPACE
  • U+2004 THREE-PER-EM SPACE
  • U+2005 FOUR-PER-EM SPACE
  • U+2006 SIX-PER-EM SPACE
  • U+2008 PUNCTUATION SPACE
  • U+2009 THIN SPACE
  • U+200a HAIR SPACE
  • U+205f MEDIUM MATHEMATICAL SPACE
  • U+3000 IDEOGRAPHIC SPACE

到目前为止,没有投诉.不过,其他小组让我感到困惑:

No complaints so far. The other group has me puzzled, though:

这些在 iswspace() iswblank()上返回 false ,但是对于 iswpunct返回 true () iswgraph().

These return false on iswspace() and iswblank(), but true for iswpunct() and iswgraph().

为什么最后三个标点符号而不是空格?

Java对此显然同意GLibC(请参阅链接页面). Unicode 将这两个组都标记为类别"Zs","Space_Separator" ...

Java agrees with GLibC on this, apparently (see linked pages). Unicode labels both groups as category 'Zs', "Space_Separator"...

推荐答案

ISO/IEC 30112 信息技术-文化惯例的规范方法状态,重点是:

ISO/IEC 30112 Information technology -- Specification methods for cultural conventions states, emphasis mine:

定义要归类为空格字符的字符,查找语法边界.[...]该类应不包括NO-BREAK空格字符 < U00A0> < U2007> < UFEFF> ,因为这些字符不应用于单词边界.

Define characters to be classified as white-space characters, to find syntactical boundaries. [...] The class should not include the NO-BREAK spaces characters <U00A0>, <U2007>, <UFEFF>, as these characters should not be used for word boundaries.

这篇关于为什么要使用“不间断空间"?GLibC中的其他ispunct()吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 14:32