问题描述
正如标题所述。我需要返回html文档的body标签中的所有内容,包括任何后续的html标签等。我很想知道最好的方法是什么。我有一个Gokogiri包的工作解决方案,但是我试图远离任何依赖于C库的包。有没有办法用go标准库来实现这一点?或者是100%的包裹?自发布我的原始问题以来,我试图使用下列不会解决问题的软件包。 (这两者似乎都不会从后面返回后面的子代或嵌套标签。例如:
<!DOCTYPE html>
< html>
< head>
< title>
文件的标题
< / title>
< / head>
< body>
主体内容
< p>更多内容< / p>
< / body>
< / html>
会返回正文内容,忽略后续的< p> 标签和它们包装的文本):
$ b
- pkg / encoding / xml /(标准库xml包)
- golang.org/x/net/html
总体目标是获得一个字符串或内容例如:
< body>
主体内容
< p>更多内容< / p>
< / body>
这可以通过递归地找到body节点来解决,使用html包,然后渲染html,从该节点开始。
包主
导入(
字节
错误
fmt
golang.org/x/net/html
io
字符串
)
func getBody(doc * html.Node)(* html.Node,error){
var b * html.Node
var f func (* html.Node)
f = func(n * html.Node){
if n.Type == html.ElementNode&& n.Data ==body{
b = n
}
for c:= n.FirstChild; c!= nil; c = c.NextSibling {
f(c)
}
}
f(doc)
if b!= nil {
return b,nil
return nil error.New(Missing< body> in the node tree)
}
func renderNode(n * html.Node)string {
var buf bytes.Buffer
w:= io.Writer(& buf)
html.Render(w,n)
return buf.String()
}
func main(){
doc,_:= html.Parse(strings.NewReader(htm))
bn,err:= getBody(doc)
if err!= nil {
return
}
body:= renderNode(bn)
fmt.Println(body)
}
const htm =`<!DOCTYPE html>
< html>
< head>
< title>< / title>
< / head>
< body>
主体内容
< p>更多内容< / p>
< / body>
< / html>`
As stated in the title. I am needing to return all of the content within the body tags of an html document, including any subsequent html tags, etc. Im curious to know what the best way to go about this is. I had a working solution with the Gokogiri package, however I am trying to stay away from any packages that depend on C libraries. Is there a way to accomplish this with the go standard library? or with a package that is 100% go?
Since posting my original question I have attempted to use the following packages that have yielded no resolution. (Neither of which seem to return subsequent children or nested tags from inside the body. For example:
<!DOCTYPE html> <html> <head> <title> Title of the document </title> </head> <body> body content <p>more content</p> </body> </html>
will return body content, ignoring the subsequent <p> tags and the text they wrap):
- pkg/encoding/xml/ (standard library xml package)
- golang.org/x/net/html
The over all goal would be to obtain a string or content that would look like:
<body> body content <p>more content</p> </body>
This can be solved by recursively finding the body node, using the html package, and subsequently render the html, starting from that node.
package main import ( "bytes" "errors" "fmt" "golang.org/x/net/html" "io" "strings" ) func getBody(doc *html.Node) (*html.Node, error) { var b *html.Node var f func(*html.Node) f = func(n *html.Node) { if n.Type == html.ElementNode && n.Data == "body" { b = n } for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) } } f(doc) if b != nil { return b, nil } return nil, errors.New("Missing <body> in the node tree") } func renderNode(n *html.Node) string { var buf bytes.Buffer w := io.Writer(&buf) html.Render(w, n) return buf.String() } func main() { doc, _ := html.Parse(strings.NewReader(htm)) bn, err := getBody(doc) if err != nil { return } body := renderNode(bn) fmt.Println(body) } const htm = `<!DOCTYPE html> <html> <head> <title></title> </head> <body> body content <p>more content</p> </body> </html>`
这篇关于Golang解析HTML,使用< body>提取所有内容< /体>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!