I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, andtitleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
您正在混合使用十进制( 8211
)和十六进制( 0x8211
You're mixing decimal (8211
) and hexadecimal (0x8211
\ x
和 \ u
都期望使用十六进制数,因此您需要使用 \ u2014
来匹配破折号,而不是 \ u8211
(对于普通的连字符等,则不是 \ x2D
and \u
both expect a hexadecimal number, therefore you'd need to use \u2014
to match the em-dash, not \u8211
(and \x2D
for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
作为Java字符串:"\\ s \\ p {Pd} \\ s"
As a Java string: "\\s\\p{Pd}\\s"