问题描述
我想C#实现有限的网络爬虫(仅几百网站)
使用HttpWebResponse.GetResponse()和Streamreader.ReadToEnd(),也试过用StreamReader.Read()和一个循环来建立我的HTML字符串。
I'm trying to implement a limited web crawler in C# (for a few hundred sites only)using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.
我只下载这是关于5-10K页。
这一切都非常慢!例如,平均的GetResponse()的时间为约半秒,而平均StreamREader.ReadToEnd()的时间为约5秒!
It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!
所有站点都应该是非常快的,因为他们是非常接近我的位置,并有快速的服务器。 (在资源管理器需要几乎没有什么给D / L)和我没有使用任何代理。
All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.
我的履带有大约20个线程来自同一个站点同时阅读。难道这是导致问题?
My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?
我如何减少StreamReader.ReadToEnd时间大大?
How do I reduce StreamReader.ReadToEnd times DRASTICALLY?
推荐答案
HttpWebRequest的可服用一段时间才能检测到您的。尝试添加给你的应用程序配置:
HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:
<system.net>
<defaultProxy enabled="false">
<proxy/>
<bypasslist/>
<module/>
</defaultProxy>
</system.net>
您也可以看到你的缓存读取,以减少对底层操作系统插槽的呼叫数量略有提高性能:
You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:
using (BufferedStream buffer = new BufferedStream(stream))
{
using (StreamReader reader = new StreamReader(buffer))
{
pageContent = reader.ReadToEnd();
}
}
这篇关于HTTPWebResponse + StreamReader的非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!