问题描述
我正在从网站下载HTML.该文件可能很大,因此在下载文件时,我希望已经解析了可用的HTML块,以便该程序的最终用户可以更快地看到该过程.我无法控制组块的生成方式,因此块可以从单词的中间开始,例如像这样:
I'm downloading HTML from a website. The file can be quite large so while the file's downloading, I want to already parse the available chunks of HTML so that the process appears faster for the end-user of my program. I don't have control over how the cunks are generated, so a chunk can begin in the middle of a word, e.g. like so:
chunk 1 ---> <div class="storyti
chunk 2 ---> tle"><a href="htt
chunk 3 ---> p://www.xkcd.com/">XKCD</a>
...and so on.
我看过一个示例,其中使用libxml2完全按照我的描述来解析XML块. libxml2还能解析HTML块吗?我已经整理好要下载的html文件,它报告警告,但没有错误. libxml2也可以解析这些HTML块吗?
I have seen example where libxml2 was used to parse XML chunks exactly how I described. Can libxml2 also parse HTML chunks? I have checked with tidy on the html files I'm going to be downloading, it reports warnings but no errors. Can libxml2 parse those HTML chunks as well?
推荐答案
libxml2具有html解析器,该解析器支持格式错误/损坏的html.请在此处检查链接.
libxml2 has a html parser which supports malformed/broken html. Please check the link here.
这篇关于libxml2 HTML块解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!