问题描述
我正在写从网页刮数据的C#控制台应用程序。
这应用程序会去约8000网页和刮数据(数据格式相同每一页)。
我有现在的工作没有异步方法,也没有多线程。
不过,我需要它要快。它仅使用约3%-6%的CPU,我觉得是因为它花费的时间等待下载HTML(WebClient.DownloadString(URL))
这是我的程序
的DataSet ALLDATA的基本流程;
的foreach(在the8000urls VAR URL)
{
// ScrapeData下载从WebClient.DownloadString
// URL中的HTML和数据擦伤分成几个数据表它返回一个数据集。
的DataSet dataForOnePage = ScrapeData(URL);
//合并在dataForOnePage每个表进ALLDATA
}
// PushAllDataToSql(ALLDATA);
我一直想多线程这一点,但我不知道如何正确地上手。我使用.NET 4.5和我的理解是异步和4.5等待都是为了使这更容易编程,但我还是有点失落。
我的想法是只是继续结交新线程异步是此行
的DataSet dataForOnePage = ScrapeData(URL);
,然后为每一个完成后,运行
//在dataForOnePage每个表合并成ALLDATA
灿有人点我在正确的方向上如何在.NET 4.5 C#该行异步然后有完整的运行我的合并方法?
感谢您。
编辑:这是我的ScrapeData方式:
公共静态数据集GetProperyData(CookieAwareWebClient Web客户端,字符串的pageid)
{
变种dsPageData =新的DataSet();
//下载HTML FOR THE REO页面,然后将它变成一个HTMLDocument的
字符串URL = @https://domain.com?&id=+的pageid + @restofurl
字符串的html = webClient.DownloadString(URL);
变种DOC =新的HTMLDocument();
doc.LoadHtml(HTML);
//使用HTMLAGILITY解析和存储dsPageData
返回dsPageData一堆;
}
如果您想使用关键字(虽然你不必,但他们做的事情在.NET 4.5更容易),你首先要改变你的 ScrapeData code>方法返回使用
异步
关键字,像这样:
异步任务<数据集> ScrapeDataAsync(URI URL)
{
//创建一个将处理cookie的HttpClientHandler。
变种处理程序=新HttpClientHandler();
//处理程序上设置cookie。
//上一个异步调用恭候在这里取,转换为数据
//设置并返回。
VAR的客户=新的HttpClient(处理);
//等待HttpResponseMessage。
HttpResponseMessage响应=等待client.GetAsync(URL);
//获取内容,等待上线的内容。
字符串内容=等待response.Content.ReadAsStringAsync();
//过程变量的内容将在这里一组数据并返回。
的DataSet DS = ...;
//返回数据集,它将返回任务<数据集取代。
返回DS;
}
请注意,你可能想从<$ C $移开C> Web客户端类,因为它不支持任务< T>
本身在它的异步操作。在.NET 4.5一个更好的选择是。我选择使用上面的HttpClient
。此外,一起来看看在,具体的。
不过,这意味着你将很可能不得不使用的await
关键字等待的其他的异步操作,在这种情况下,将超过可能是网页的下载。你必须调整你的调用,将数据下载到使用这些异步版本和等待
。
在这是完整的,你通常会调用等待
上,但你不能做,在这种情况下,因为你将的await
上一个变量。在这种情况下,你正在运行一个循环,所以变量将与每个迭代复位。在这种情况下,最好是只存储任务< T>
在这样一个数组:
的DataSet ALLDATA = ...;
变种任务=新的List<任务< DataSet中>>();
的foreach(在the8000urls VAR URL)
{
// ScrapeData下载从
// WebClient.DownloadString
//和URL的HTML数据擦伤成几个数据表其中
//它返回一个数据集。
tasks.Add(ScrapeDataAsync(URL));
}
有是数据合并到 ALLDATA的事
。为此,你要拨打的中的任务< T>返回
实例,并执行将数据以 ALLDATA
的任务:
的DataSet ALLDATA = ...;
变种任务=新的List<任务< DataSet中>>();
的foreach(在the8000urls VAR URL)
{
// ScrapeData下载从
// WebClient.DownloadString
//和URL的HTML数据擦伤成几个数据表其中
//它返回一个数据集。
tasks.Add(ScrapeDataAsync(URL).ContinueWith(叔= GT; {
//锁访问数据集,因为这是
//异步现在
锁。 (ALLDATA)
{
//添加数据
}
});
}
然后,您可以在所有的任务使用的和等待
上:
//后您的循环
等待Task.WhenAll(任务);
/ /过程ALLDATA
不过,请注意,你有一个的foreach
和 WhenAll
需要一个实施这是一个很好的指标,这是适合使用LINQ,它是:
的DataSet ALLDATA;
变种任务=从URL
在the8000Urls
选择ScrapeDataAsync(URL).ContinueWith(T => {
//锁访问数据集,因为这是
//异步现在
锁(ALLDATA)
{
//添加数据
}
})。
等待Task.WhenAll(任务);
//处理ALLDATA
您也可以选择不使用查询语法如果你愿意,它不会在这种情况下无所谓。
请注意,如果包含方法没有标记为异步
(因为你是在一个控制台应用程序并等待结果的应用程序终止之前),那么你可以简单地调用的 WhenAll
:
//这将阻止,等待所有任务完成,所有
//任务将异步运行,当所有的都做了,那么
//代码将继续执行。
Task.WhenAll(任务).Wait();
//处理ALLDATA。
也就是说,问题是,你要收集你的工作
实例成序列,然后对整个序列等你处理 ALLDATA
之前。
不过,我建议尝试合并入 ALLDATA
如果你之前可以处理数据;除非数据处理需要的全部的的DataSet
,你会得到通过处理尽可能多的数据更是性能提升,你回来的在的你拿回来,而不是等待它的所有的回去。
I'm writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page).
I have it working right now with no async methods and no multithreading.
However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))
This is the basic flow of my program
DataSet alldata;
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
//merge each table in dataForOnePage into allData
}
// PushAllDataToSql(alldata);
Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.
My idea was to just keep making new threads that are async for this line
DataSet dataForOnePage = ScrapeData(url);
and then as each one finishes, run
//merge each table in dataForOnePage into allData
Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?
Thank you.
Edit: Here is my ScrapeData method:
public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid)
{
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = @"https://domain.com?&id=" + pageid + @"restofurl";
string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html );
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData ;
}
If you want to use the async
and await
keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData
method to return a Task<T>
instance using the async
keyword, like so:
async Task<DataSet> ScrapeDataAsync(Uri url)
{
// Create the HttpClientHandler which will handle cookies.
var handler = new HttpClientHandler();
// Set cookies on handler.
// Await on an async call to fetch here, convert to a data
// set and return.
var client = new HttpClient(handler);
// Wait for the HttpResponseMessage.
HttpResponseMessage response = await client.GetAsync(url);
// Get the content, await on the string content.
string content = await response.Content.ReadAsStringAsync();
// Process content variable here into a data set and return.
DataSet ds = ...;
// Return the DataSet, it will return Task<DataSet>.
return ds;
}
Note that you'll probably want to move away from the WebClient
class, as it doesn't support Task<T>
inherently in its async operations. A better choice in .NET 4.5 is the HttpClient
class. I've chosen to use HttpClient
above. Also, take a look at the HttpClientHandler
class, specifically the CookieContainer
property which you'll use to send cookies with each request.
However, this means that you will more than likely have to use the await
keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await
on those.
Once that is complete, you would normally call await
on that, but you can't do that in this scenario because you would await
on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T>
in an array like so:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url));
}
There is the matter of merging the data into allData
. To that end, you want to call the ContinueWith
method on the Task<T>
instance returned and perform the task of adding the data to allData
:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
}
Then, you can wait on all the tasks using the WhenAll
method on the Task
class and await
on that:
// After your loop.
await Task.WhenAll(tasks);
// Process allData
However, note that you have a foreach
, and WhenAll
takes an IEnumerable<T>
implementation. This is a good indicator that this is suitable to use LINQ, which it is:
DataSet alldata;
var tasks =
from url in the8000Urls
select ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
await Task.WhenAll(tasks);
// Process allData
You can also choose not to use query syntax if you wish, it doesn't matter in this case.
Note that if the containing method is not marked as async
(because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait
method on the Task
returned when you call WhenAll
:
// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();
// Process allData.
Namely, the point is, you want to collect your Task
instances into a sequence and then wait on the entire sequence before you process allData
.
However, I'd suggest trying to process the data before merging it into allData
if you can; unless the data processing requires the entire DataSet
, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.
这篇关于C#.NET 4.5异步/多线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!