问题描述
我正致力于推特数据规范化。 Twitter用户经常使用像我这样的术语,以便强调爱这个词。我希望通过替换重复的字符直到我得到一个合适的有意义的单词来重复这些重复的字符到一个正确的英语单词(我知道我无法通过这种机制来区分善与否)。
I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
我的策略是
-
识别这种重复字符串的存在。我会寻找超过2个相同的字符,因为可能没有超过两个重复字符的英文单词。
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
在Lexicon中搜索此类单词Wordnet
Search for such words in a Lexicon like Wordnet
由于我的Java知识不足,我无法管理3和4.问题是我不能替换除了两个重复的连续字符以外
以下代码片段替换除了一个重复的字符以外的所有字符 System.out.println(data.replaceAll(([a-zA-Z])\\1 {2,}, $ 1));
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
需要帮助才能找到
A.如何更换除2个连续重复字符以外的所有字符
B.如何从A
的输出中删除一个连续的字符[我认为B可以通过以下代码片段进行管理]
Help is required to find out A. How to replace all but 2 consecutive repeat charactersB. How to remove one more consecutive character from the output of A[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
编辑:WiktorStribiżew提供的解决方案在Java中完美运行。我想知道在python中获得相同结果需要进行哪些更改。
Python使用re.sub。
Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.Python uses re.sub.
推荐答案
你的正则表达式([az])\ \ {2,}
匹配并将ASCII字母捕获到组1中,然后匹配此值的2次或更多次出现。因此,您需要使用反向引用替换所有内容, $ 1
,其中包含捕获的值。如果您使用一个 $ 1
, aaaaa
将替换为单个 a
如果您使用 $ 1 $ 1
,它将替换为 aa
。
Your regex ([a-z])\\1{2,}
matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1
, that holds the value captured. If you use one $1
, the aaaaa
will be replaced with a single a
and if you use $1$1
, it will be replaced with aa
.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
参见。
如果你需要使你的正则表达式不区分大小写,请使用(?i)([az ])\\\\ {2,}
甚至(\\\\ {Alpha})\\1 {2,}
。如果必须处理任何Unicode字母,请使用(\\\\ {L})\\1 {2,}
。
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}"
or even "(\\p{Alpha})\\1{2,}"
. If any Unicode letters must be handled, use "(\\p{L})\\1{2,}"
.
这篇关于替换java中连续重复的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!