我正在用ASP.NET编写一个web应用程序。我需要正则表达式的帮助。我需要两个表达式,第一个可以帮助我获取并最终用单引号替换HTML标记中的每个双引号字符,第二个可以用"
获取并替换HTML标记中不包含的每个双引号字符。
例如:<p>This is a "wonderful long text". "Another wonderful ong text"</p> At least it should be. Here we have a <a href="http://wwww.site-to-nowhere.com" target="_blank">link</a>
应该这样改变。<p>This is a "wonderful long text". "Another wonderful ong text"</p> At least it should be. Here we have a <a href='http://wwww.site-to-nowhere.com' target='_blank'>link</a>
我试过以下表达式:
"([^<>]*?)"(?=[^>]+?<)
但问题是它无法捕获
"Another wonderful ong text"
可能是因为它位于</p>
标记的旁边。你能帮我解决这个问题吗?或者,在.NET中是否有其他解决此替换问题的解决方案?
最佳答案
Don't use regex to parse HTML。我可以推荐HtmlAgilityPack
:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is your HTML-string
var textNodes = doc.DocumentNode.SelectNodes("//text()");
foreach (HtmlAgilityPack.HtmlTextNode node in textNodes)
{
node.Text = node.Text.Replace("\"", """);
}
StringWriter sw = new StringWriter();
doc.Save(sw);
string result = sw.ToString();
我已经用你的HTML示例测试过了,这是(期望的)结果:
<p>This is a "wonderful long text". "Another wonderful ong text"</p> At least it should be. Here we have a <a href="http://wwww.site-to-nowhere.com" target="_blank">link</a>