问题描述
我已经构建了一个小型搜寻器,现在尝试使用它时,发现在搜寻某些网站时,我的搜寻器使用98-99%的CPU.
I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
我使用dotTrace
来查看问题可能是什么,它使我朝着httpwebrequest
方法指向-我在这里有一些有关stackoverflow的先前问题,对此进行了一些优化.
I used dotTrace
to see what the problem could be and it pointed me towards my httpwebrequest
method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
然后,我去查看哪些URL导致了CPU负载,并发现它实际上是规模非常大的网站-上图:)所以,现在我99%确信它与以下代码段有关:
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");
我要做的就是提取页面上的链接,因此对于大型站点..无论如何,我可以使它不占用太多CPU吗?
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
我在想可能限制我获取的内容吗?我在这里最好的选择是什么?
I was thinking maybe limit what i fetch? What would be my best option here?
肯定有人在:)
推荐答案
".//a [@href]"是非常慢的XPath.试图替换为"//a [@href]"或简单地遍历整个文档并检查所有A节点的代码.
".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.
为什么此XPath速度慢:
Why this XPath is slow:
- ".从节点开始
- "//"选择所有后代节点
- "a"-仅选择"a"个节点
- 带有href的"@href".
部分1 + 2的结尾是为每个节点选择其所有后代节点",这非常慢.
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.
这篇关于HtmlAgilityPack和大型HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!