我已经安装了fckeditor,从MS Word粘贴时会添加很多不必要的格式。我要保留某些内容,例如粗体,斜体,弹头等。我在网上搜索并提出了解决方案,该解决方案将所有内容都删除了,甚至包括我想保留的内容,如粗体和斜体。有没有办法只删除不必要的单词格式?
最佳答案
这是我用来从富文本编辑器中清除传入HTML的解决方案...它是用VB.NET编写的,我没有时间转换为C#,但这很简单:
Public Shared Function CleanHtml(ByVal html As String) As String
'' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
'' Only returns acceptable HTML, and converts line breaks to <br />
'' Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
'' Does this have HTML tags?
If html.IndexOf("<") >= 0 Then
'' Make all tags lowercase
html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
'' Filter out anything except allowed tags
'' Problem: this strips attributes, including href from a
'' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
'' Make all BR/br tags look the same, and trim them of whitespace before/after
html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
End If
'' No CRs
html = html.Replace(controlChars.CR, "")
'' Convert remaining LFs to line breaks
html = html.Replace(controlChars.LF, "<br />")
'' Trim BRs at the end of any string, and spaces on either side
Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
End Function
Public Shared Function LowerTag(m As Match) As String
Return m.ToString().ToLower()
End Function
在您的情况下,您需要修改“AcceptableTags”中“已批准” HTML标记的列表-代码仍将剥离所有无用的属性(不幸的是,有用的属性如HREF和SRC,希望这些不是“对您很重要)。
当然,这需要访问服务器。如果您不希望这样做,则需要在工具栏上添加某种“清理”按钮,以调用JavaScript来使编辑器的当前文本困惑。不幸的是,“粘贴”不是可以自动清除标记的陷阱,每次OnChange之后的清除都会使编辑器无法使用(因为更改标记会更改文本光标的位置)。