本文介绍了如何使用Node.js解析HTML页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析(服务器端)大量的HTML页面.
我们都同意,regexp不是通向此处的方法.
在我看来,javascript是解析HTML页面的本机方式,但是这种假设依赖于服务器端代码具有javascript在浏览器内部具有的所有DOM功能.

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Node.js是否具有内置功能?
有没有更好的方法可以解决此问题,那就是在服务器端解析HTML?

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

推荐答案

您可以使用 npm 模块 jsdom htmlparser 来创建和解析Node.JS中的DOM.

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

其他选项包括:

  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

在所有这些选项中,我更喜欢使用Node.js选项,因为它使用了标准的W3C DOM访问器方法,并且我可以在客户端和服务器上重用代码.我希望BeautifulSoup的方法与W3C dom更相似,并且我认为将HTML转换为XHTML以编写XSLT简直是可悲的.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

这篇关于如何使用Node.js解析HTML页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 18:17