java - 逃脱角色的艰难时期

我需要从字符串中去除一些无效字符，并编写StringUtil库的以下代码部分：

public static String removeBlockedCharacters(String data) {
    if (data==null) {
      return data;
    }
    return data.replaceAll("(?i)[<|>|\u003C|\u003E]", "");
}

我有一个测试文件lawnicalCharacter.txt，其中一行：

hello \u003c here < and > there

我运行以下单元测试：

@Test
public void testBlockedCharactersRemoval() throws IOException{
    checkEquals(StringUtil.removeBlockedCharacters("a < b > c\u003e\u003E\u003c\u003C"), "a  b  c");
    log.info("Procesing from string directly: " + StringUtil.removeBlockedCharacters("hello \u003c here < and > there"));
    log.info("Procesing from file to string:  " + StringUtil.removeBlockedCharacters(FileUtils.readFileToString(new File("src/test/resources/illegalCharacters.txt"))));
}

我得到：

INFO - 2010-09-14 13:37:36,111 - TestStringUtil.testBlockedCharactersRemoval(36) | Procesing from string directly: hello  here  and  there
INFO - 2010-09-14 13:37:36,126 - TestStringUtil.testBlockedCharactersRemoval(37) | Procesing from file to string:  hello \u003c here  and  there

我很困惑：如您所见，如果我传递包含这些值的字符串，则代码会正确地去除“ ”和“ \ u003c”，但是如果我阅读则无法去除“ \ u003c”来自包含相同字符串的文件。

我的问题是，让我不再在上面松散头发了：

为什么会出现这种现象？
如何在所有情况下更改代码以正确剥离\ u003c？

谢谢

最佳答案

编译源文件时，在进行任何词法分析或语法分析之前，首先发生的是Unicode转义\u003C和\u003E被转换为实际字符<和> 。因此，您的代码实际上是：

return data.replaceAll("(?i)[<|>|<|>]", "");

当您针对字符串文字编译测试代码时，会发生相同的事情。您编写为的测试字符串：

"a < b > c\u003e\u003E\u003c\u003C"

...是真的：

"a < b > c>><<"

但是，当您从文件中读取测试字符串时，不会发生这种转换。您最终尝试将六个字符的序列\u003c与单个字符<匹配。如果您确实想匹配\u003C和\u003E，则代码应如下所示：

return data.replaceAll("(?i)(?:<|>|\\\\u003C|\\\\u003E)", "");

如果使用一个反斜杠，则Java编译器会将其解释为Unicode转义并将其转换为<或>。
如果使用两个反斜杠，则正则表达式编译器会将其解释为Unicode转义，并认为您要匹配<或>。
如果使用三个反斜杠，则Java编译器会将其转换为\<或\>，正则表达式编译器将忽略反斜杠，并尝试匹配<或>。
因此，要匹配原始的Unicode转义序列，必须使用四个反斜杠来匹配转义序列中的一个反斜杠。

请注意，我也改变了您的括号。 [<|>]是与<，|或>匹配的character class；您想要的是alternation。