问题描述
我需要在类库中创建一个方法来获取URL的内容(可能由JavaScript动态填充).
I need to create a method in a class library to get the content of a URL (which may be dynamically populated by JavaScript).
我一无所知,但是整天谷歌搜索是我想到的:(大多数代码来自)
I am clueless, but having googling for the whole day this is what I came up with: (Most of the code is from here)
using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public static class WebScraper
{
[STAThread]
public async static Task<string> LoadDynamicPage(string url, CancellationToken token)
{
using (WebBrowser webBrowser = new WebBrowser())
{
// Navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler onDocumentComplete = (s, arg) => tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
webBrowser.DocumentCompleted += onDocumentComplete;
try
{
webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
webBrowser.DocumentCompleted -= onDocumentComplete;
}
}
// get the root element
var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
}
}
当前会引发此错误
我要关闭吗?上面有解决方法吗?
Am I close? Is there a fix for the above?
或者如果我不在轨道上,是否有现成的解决方案可以使用.NET(可以从类库中调用)获取动态Web内容?
Or if I am off the track, is there a ready solution to get dynamic web content using .NET (that can be called from a class library)?
推荐答案
这是我在Web应用程序中测试并正常工作的内容.
Here is what I tested in a web application and worked properly.
它在另一个线程中使用 WebBrowser
控件,并返回一个 Task< string>
,其中包含的内容在浏览器内容完全加载时完成:
It uses a WebBrowser
control in another thread and returns a Task<string>
containing which completes when the browser content load completely:
using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public class BrowserBasedWebScraper
{
public static Task<string> LoadUrl(string url)
{
var tcs = new TaskCompletionSource<string>();
Thread thread = new Thread(() => {
try {
Func<string> f = () => {
using (WebBrowser browser = new WebBrowser())
{
browser.ScriptErrorsSuppressed = true;
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete)
{
System.Windows.Forms.Application.DoEvents();
}
return browser.DocumentText;
}
};
tcs.SetResult(f());
}
catch (Exception e) {
tcs.SetException(e);
}
});
thread.SetApartmentState(ApartmentState.STA);
thread.IsBackground = true;
thread.Start();
return tcs.Task;
}
}
这篇关于在类库中使用WebBrowser进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!