HtmlAgilityPack和大型HTML文档

本文介绍了HtmlAgilityPack和大型HTML文档的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经构建了一个小型搜寻器，现在尝试使用它时，发现在搜寻某些网站时，我的搜寻器使用98-99％的CPU.

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.

我使用dotTrace来查看问题可能是什么，它使我朝着httpwebrequest方法指向-我在这里有一些有关stackoverflow的先前问题，对此进行了一些优化.

I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.

然后，我去查看哪些URL导致了CPU负载，并发现它实际上是规模非常大的网站-上图:)所以，现在我99％确信它与以下代码段有关:

I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)So, now i am 99% certain it has to do with the following piece of code:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");

我要做的就是提取页面上的链接，因此对于大型站点..无论如何，我可以使它不占用太多CPU吗?

All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?

我在想可能限制我获取的内容吗?我在这里最好的选择是什么?

I was thinking maybe limit what i fetch? What would be my best option here?

肯定有人在:)

HtmlAgilityPack和大型HTML文档

问题描述

推荐答案