问题描述
我可以使用XML解析器解析HTML文件吗?
为什么可以('t)我这样做。我知道XML用于存储数据,并且HTML用于显示数据。但在语法上它们几乎是相同的。
预期的用途是制作一个HTML解析器,它是一个网络爬虫应用程序的一部分
- 永远不会结束的元素标签,并且不使用XML的所谓自闭标签语法;例如, < br> , < meta> , < link> 和 < img> strong>(也称为 void 元素)
- 元素不需要结束标记;例如 < p> < dt> < li> (它们的结束标签可以是隐含的)可包含非转义标记< 字符的元素;例如风格, textarea ,标题,脚本; <脚本>如果(a< b)...< / script> ,< title>使用<运算符< / title>
- 属性与未加引号值;例如,< meta charset = utf-8 >
- 属性为空,根本没有单独的值;例如,<输入 已停用 >
XML解析器将无法解析任何使用这些功能的HTML文档。
另一方面,HTML解析器基本上永远不会失败,无论文档包含什么。
$ b所有这些都表明,开发一种新型的XML解析工具 - 所谓的 XML5解析 - 能够处理像空/未引用属性属性这样的东西,即使在XML中也是如此文档。有一个,以及。
$ b
你要创建一个web爬虫应用程序,你应该绝对使用一个HTML解析器 - 理想情况下,一个符合。
现在,很多(或甚至大部分)语言;例如:
- (python)
- (生锈)
- (Java)
- (c,含绑定为ruby,objective c,c ++,per,php,c#,perl,lua,D,julia ...)
Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
解决方案You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
- elements that never have end tags and that don’t use XML’s so-called "self-closing tag syntax"; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
- elements that don’t require end tags; e.g., <p> <dt> <li> (their end tags can be implied)
- elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
- attributes with unquoted values; for example, <metacharset=utf-8>
- attributes that are empty, with no separate value given at all; e.g., <inputdisabled>
An XML parser will fail to parse any HTML document that uses any of those features.
An HTML parser, on the other hand, will basically never fail no matter what a document contains.
All that said, there has also been work done toward developing a new type of XML parsing—so-called XML5 parsing—capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
- parse5 (node.js/JavaScript)
- html5lib (python)
- html5ever (rust)
- validator.nu html5 parser (java)
- gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)
这篇关于使用XML解析器解析html文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!