问题描述
这是关于 CamelCase regex 的问题.与 tchrist post 的组合我想知道什么是正确的 utf-8 CamelCase .
Here was a question about the CamelCase regex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.
以(brian d foy's)正则表达式开始:
Starting with (brian d foy's) regex:
/
\b # start at word boundary
[A-Z] # start with upper
[a-zA-Z]* # followed by any alpha
(?: # non-capturing grouping for alternation precedence
[a-z][a-zA-Z]*[A-Z] # next bit is lower, any zero or more, ending with upper
| # or
[A-Z][a-zA-Z]*[a-z] # next bit is upper, any zero or more, ending with lower
)
[a-zA-Z]* # anything that's left
\b # end at word
/x
并修改为:
/
\b # start at word boundary
\p{Uppercase_Letter} # start with upper
\p{Alphabetic}* # followed by any alpha
(?: # non-capturing grouping for alternation precedence
\p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper
| # or
\p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower
)
\p{Alphabetic}* # anything that's left
\b # end at word
/x
标有"###"的行有问题.
Have a problem with lines marked '###'.
此外,当假设数字大于和下划线等于小写字母时,如何修改正则表达式,因此W2X3是有效的CamelCase单词.
In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.
已更新 :( ysth评论)
Updated: (ysth comment)
接下来的
-
any
:表示大写或小写或数字或下划线"
any
: mean "uppercase or lowercase or number or underscore"
正则表达式应与CamelWord,CaW相匹配
The regex should match CamelWord, CaW
- 以大写字母开头
- 可选任意
- 小写字母或数字或下划线
- 可选任意
- 大写字母
- 可选任意
请不要将其标记为重复项,因为并非如此. 原始问题(也有答案)只考虑了ascii.
Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.
推荐答案
我真的无法告诉您您要做什么,但这应该更接近您最初的意图.不过,我仍然无法说出您的意思.
I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.
m{
\b
\p{Upper} # start with uppercase code point (NOT LETTER)
\w* # optional ident chars
# note that upper and lower are not related to letters
(?: \p{Lower} \w* \p{Upper}
| \p{Upper} \w* \p{Lower}
)
\w*
\b
}x
请勿使用[a-z]
.实际上,不要使用\p{Lowercase_Letter}
或\p{Ll}
,因为它们与更理想,更正确的\p{Lowercase}
和\p{Lower}
不同.
Never use [a-z]
. And in fact, don’t use \p{Lowercase_Letter}
or \p{Ll}
, since those are not the same as the more desirable and more correct \p{Lowercase}
and \p{Lower}
.
请记住,\w
实际上只是
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]
这篇关于Utf8在Perl中为CamelCase(WikiWord)正确的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!