看上去很明显,但找不到类似的东西。我想拆分一些文本,并希望拆分条件的模式成为第一个拆分部分的一部分。
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
返回:
['Hi there', " It's a nice weather", ' Have a great day', '']
我想要的是它返回:
['Hi there.', " It's a nice weather.", ' Have a great day.']
顺便说一句:我只对这个重新解决方案感兴趣,而不是一些nltk库用其他方法做的事情。
最佳答案
您可以使用lookbehind在空白处分割,以解释该时段。此外,为了考虑没有空白的可能性,可以使用lookahead:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
输出:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re
说明:(?<=\.)
=>位置查找后,必须匹配a.
才能匹配下一个序列。\s
=>匹配空白(
)。|
=>将尝试将表达式与其左侧或右侧匹配的条件,具体取决于首先匹配的是哪一侧。\.
=>匹配句点(?=[A-Z])
如果下一个字符是大写字母,则匹配后一个句点。关于python - 拆分文本,但在第一个拆分部分中包含模式,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58467573/