java - 用于删除特定HTML标签的正则表达式

我试图用Java编写一个正则表达式，以删除<select>开头和select </>标记结尾的所有内容，如下所示。我编写了一个正则表达式以删除从<start>标记开始的所有内容，如下所示为空。问题是它正在按预期删除所有内容，但第四行<select name="first" ... the popular除外。它将删除该行中的所有内容，并忽略下一行and ... president"/>中的内容。我想包括从开始和结束标记的所有内容。我怎样才能做到这一点？

str.replaceAll(".*<start.*", "");

实际的String str的内容如下所示：

<select name="id" content="2454803.html"/>
<select name="nameid" content="2454803"/>
<select name="type" content="prd"/>
<select name="first" content="In 2004, Charlie, the popular
and charismatic senator , became the first president"/>
<select name="title" content="Charlie"/>
<h1>
<!--toc:insert content="checkbox" id="_1_0"/>-->
</h1>
<p class="tocline"><a href="2454803">Table of Contents</a></p>

最佳答案

根据Java文档，在Pattern.html#lt处：

  除非指定.标志，否则正则表达式DOTALL匹配除行终止符之外的任何字符。

行终止符的含义是：


  换行符（换行符）（'\n'），
  回车符，后跟换行符（"\r\n"），
  独立的回车符（'\r'），
  下一行字符（'\u0085'），
  行分隔符（'\u2028'），或
  段落分隔符（'\u2029）。


指定DOTALL标志的最简单方法是在正则表达式的开头添加(?s)。还需要进行一些其他更改以容纳此标志，因此最终的正则表达式将为(?s)<select.*?>\r?\n?，适用于

str.replaceAll("(?s)<select.*?>\\r?\\n?", "");

这里的演示：http://regex101.com/r/bW8aR7

另外，您可以使用正则表达式<select[^>]*>\r?\n?，如下所示：

str.replaceAll("<select[^>]*>\\r?\\n?", "");

这里的演示：http://regex101.com/r/lO6mQ6

关于java - 用于删除特定HTML标签的正则表达式，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/22236709/