本文介绍了Python:按所有空格字符拆分字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

python中用空格分割字符串,通常使用不带参数的字符串的split方法:

.

对于其他空格字符,常规的 split 就足够了:

>>>u'a\u200Ac'.split()[u'a', u'c']

如果这对您来说还不够,请按照 Gabi Purcaru 下面的建议逐个添加字符.

解决方案

编辑

事实证明, \u200b 在技术上并未定义为 whitespace ,因此即使启用了 unicode 标志,python 也无法将其识别为匹配的 \s .因此必须将其视为非空白字符.

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

http://bugs.python.org/issue13391

导入重新re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)

To split strings by spaces in python, one usually uses split method of the string without parameters:

>>> 'a\tb c\nd'.split()
['a', 'b', 'c', 'd']

But yesterday I ran across a string that used ZERO WIDTH SPACE between words as well. Having turned my new knowledge in a short black magic performance (among JavaScript folks), I would like to ask how to better split by all whitespace characters, since the split is not enough:

>>> u'a\u200bc d'.split()
[u'a\u200bc', u'd']

UPD1

it seems the solution suggested by sth gererally works but depends on some OS settings or Python compilation options. It would be nice to know the reason for sure (and if the setting can be switched on in Windows).

UPD2cptphil found a great link that makes everything clear:

A quotation from unicode site:

The change was then reflected in Python. The result of u'\u200B'.isspace() in Python 2.5.4 and 2.6.5 is True, in Python 2.7.1 it is already False.

For other space characters regular split is enough:

>>> u'a\u200Ac'.split()
[u'a', u'c']

And if that is not enough for you, add characters one by one as Gabi Purcaru suggests below.

解决方案

Edit

It turns out that \u200b is not technically defined as whitespace , and so python does not recognize it as matching \s even with the unicode flag on. So it must be treated as an non-whitespace character.

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

http://bugs.python.org/issue13391

import re

re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)

这篇关于Python:按所有空格字符拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 09:29
查看更多