问题描述
python中用空格分割字符串,通常使用不带参数的字符串的split
方法:
对于其他空格字符,常规的 split
就足够了:
如果这对您来说还不够,请按照 Gabi Purcaru
下面的建议逐个添加字符.
编辑
事实证明, \u200b 在技术上并未定义为 whitespace ,因此即使启用了 unicode 标志,python 也无法将其识别为匹配的 \s .因此必须将其视为非空白字符.
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://bugs.python.org/issue13391
导入重新re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)
To split strings by spaces in python, one usually uses split
method of the string without parameters:
>>> 'a\tb c\nd'.split()
['a', 'b', 'c', 'd']
But yesterday I ran across a string that used ZERO WIDTH SPACE between words as well. Having turned my new knowledge in a short black magic performance (among JavaScript folks), I would like to ask how to better split by all whitespace characters, since the split
is not enough:
>>> u'a\u200bc d'.split()
[u'a\u200bc', u'd']
UPD1
it seems the solution suggested by sth
gererally works but depends on some OS settings or Python compilation options. It would be nice to know the reason for sure (and if the setting can be switched on in Windows).
UPD2cptphil
found a great link that makes everything clear:
A quotation from unicode site:
The change was then reflected in Python. The result of u'\u200B'.isspace()
in Python 2.5.4 and 2.6.5 is True
, in Python 2.7.1 it is already False
.
For other space characters regular split
is enough:
>>> u'a\u200Ac'.split()
[u'a', u'c']
And if that is not enough for you, add characters one by one as Gabi Purcaru
suggests below.
Edit
It turns out that \u200b is not technically defined as whitespace , and so python does not recognize it as matching \s even with the unicode flag on. So it must be treated as an non-whitespace character.
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://bugs.python.org/issue13391
import re
re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)
这篇关于Python:按所有空格字符拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!