HtmlAgilityPack和大型HTML文档

HtmlAgilityPack和大型HTML文档

本文介绍了HtmlAgilityPack和大型HTML文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经构建了一个小型搜寻器,现在尝试使用它时,发现在搜寻某些网站时,我的搜寻器使用98-99%的CPU.

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.

我使用dotTrace来查看问题可能是什么,它使我朝着httpwebrequest方法指向-我在这里有一些有关stackoverflow的先前问题,对此进行了一些优化.

I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.

然后,我去查看哪些URL导致了CPU负载,并发现它实际上是规模非常大的网站-上图:)所以,现在我99%确信它与以下代码段有关:

I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)So, now i am 99% certain it has to do with the following piece of code:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");

我要做的就是提取页面上的链接,因此对于大型站点..无论如何,我可以使它不占用太多CPU吗?

All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?

我在想可能限制我获取的内容吗?我在这里最好的选择是什么?

I was thinking maybe limit what i fetch? What would be my best option here?

肯定有人在:)

推荐答案

".//a [@href]"是非常慢的XPath.试图替换为"//a [@href]"或简单地遍历整个文档并检查所有A节点的代码.

".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.

为什么此XPath速度慢:

Why this XPath is slow:

  1. ".从节点开始
  2. "//"选择所有后代节点
  3. "a"-仅选择"a"个节点
  4. 带有href的"@href".

部分1 + 2的结尾是为每个节点选择其所有后代节点",这非常慢.

Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

这篇关于HtmlAgilityPack和大型HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 04:14