如何将标记NEG_
添加到not
、no
和never
后面的所有单词中,直到字符串中的下一个标点符号(用于情绪分析)?我认为可以使用正则表达式,但我不确定如何使用。
输入:It was never going to work, he thought. He did not play so well, so he had to practice some more.
期望输出:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
知道怎么解决这个问题吗?
最佳答案
为了弥补Python的regex引擎缺少一些Perl功能,可以在re
函数中使用lambda表达式来创建动态替换:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
将打印(demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
解释
第一步是选择你感兴趣的部分。这已经结束了
\b(?:not|never|no)\b[\w\s]+[^\w\s]
你的否定关键字(
re.sub
是一个单词边界,\b
是一个非捕获组),后面是alpahnum和空格((?:...)
是\w
,[0-9a-zA-Z_]
是所有类型的空白),直到既不是字母也不是空格(用作标点符号)。注意这里的标点符号是必需的,但是您也可以安全地删除
\s
来匹配字符串的结尾。现在您要处理的是
[^\w\s]
类型的字符串。只需选择前面有空格的单词(\s+)(\w+)
换成你想要的
\1NEG_\2