问题描述
我有这样的html字符串:
< html>< body>< p> foo< a href ='http://www.example.com'> bar< / a>巴兹< / P>< /体>< / HTML>
我希望去除所有html标签,以便生成的字符串变为:
foo bar baz
从在这里的另一篇文章中,我提出了这个函数(它使用Html Agility Pack):
公共共享函数stripTags (ByVal html As String)As String
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)
Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes(// html | // body | // p | // a)
如果不是htmldoc没有那么
For Each node in invalidNodes
node.ParentNode.RemoveChild(node,True)
Next
End If
返回htmldoc.DocumentNode.WriteContentTo
End Function
不幸的是,这并没有回报我期望的结果,而是给出了:
bazbarfoo
请问哪里出错 - 这是最好的方法吗?
问候和快乐的编码!
更新:通过下面的答案,我想出了这个函数可能对其他人有用:
$ b $ pre $ 公共共享函数stripTags(ByVal html As String)As String
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html.Replace(< / p>,< / p>& New String(Environment.NewLine,2))。Replace(< br />,Environment.NewLine))
返回htmldoc.DocumentNode.InnerText
End Function
I have a html string like this:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
I wish to strip all html tags so that the resulting string becomes:
foo bar baz
From another post here at SO I've come up with this function (which uses the Html Agility Pack):
Public Shared Function stripTags(ByVal html As String) As String Dim plain As String = String.Empty Dim htmldoc As New HtmlAgilityPack.HtmlDocument htmldoc.LoadHtml(html) Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a") If Not htmldoc Is Nothing Then For Each node In invalidNodes node.ParentNode.RemoveChild(node, True) Next End If Return htmldoc.DocumentNode.WriteContentTo End Function
Unfortunately this does not return what I expect, instead it gives:
bazbarfoo
Please, where do I go wrong - and is this the best approach?
Regards and happy coding!
UPDATE: by the answer below I came up with this function, might be usefull to others:
Public Shared Function stripTags(ByVal html As String) As String Dim htmldoc As New HtmlAgilityPack.HtmlDocument htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine)) Return htmldoc.DocumentNode.InnerText End Function
Why not just return htmldoc.DocumentNode.InnerText instead of removing all the non-text nodes? It should give you what you want.
这篇关于用Html Agility Pack剥离所有的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!