本文介绍了从字符串中删除所有标点符号,除非它在数字之间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含单词和数字的文本.我举个有代表性的文字例子:

I have a text that contains words and numbers. I'll give a representative example of the text:

string = "This is a 1example of the text. But, it only is 2.5 percent of all data"

我想将其转换为类似:

"This is a  1 example of the text But it only is  2.5  percent of all data"

因此删除标点(可以是. ,string.punctuation 中的任何其他内容)并放入数字和单词连接时的空格.但是在我的例子中保持像 2.5 这样的浮点数.

So removing punctuation (can be . , or any other in string.punctuation) and also put a space between digits and words when it is concatenated. But keep the floats like 2.5 in my example.

我使用了以下代码:

item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item

结果是:

 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"

我快到了,但无法弄清楚最后的平静.

I'm almost there but can't figure out that last peace.

推荐答案

你可以像这样使用正则表达式:

You can use regex lookarounds like this:

(?<!\d)[.,;:](?!\d)

工作演示

这个想法是让一个字符类收集你想要替换的标点符号,并使用环视来匹配周围没有数字的标点符号

The idea is to have a character class gathering the punctuation you want to replace and use lookarounds to match punctuation that does not have digits around

regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)

结果是:

This is a 1example of the text But it only is 2.5 percent of all data

这篇关于从字符串中删除所有标点符号,除非它在数字之间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 07:55