


I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.


Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?


您可以使用 npm 模块 jsdom htmlparser 来创建和解析Node.JS中的DOM.

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.


  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

在所有这些选项中,我更喜欢使用Node.js选项,因为它使用了标准的W3C DOM访问器方法,并且我可以在客户端和服务器上重用代码.我希望BeautifulSoup的方法与W3C dom更相似,并且我认为将HTML转换为XHTML以编写XSLT简直是可悲的.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.


09-03 18:17