本文介绍了在类库中使用WebBrowser进行Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在类库中创建一个方法来获取URL的内容(可能由JavaScript动态填充).

I need to create a method in a class library to get the content of a URL (which may be dynamically populated by JavaScript).

我一无所知,但是整天谷歌搜索是我想到的:(大多数代码来自)

I am clueless, but having googling for the whole day this is what I came up with: (Most of the code is from here)

using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;

public static class WebScraper
{
    [STAThread]
    public async static Task<string> LoadDynamicPage(string url, CancellationToken token)
    {
        using (WebBrowser webBrowser = new WebBrowser())
        {
            // Navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler onDocumentComplete = (s, arg) => tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                webBrowser.DocumentCompleted += onDocumentComplete;
                try
                {
                    webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    webBrowser.DocumentCompleted -= onDocumentComplete;
                }
            }

            // get the root element
            var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(500, token);

                // continue polling if the WebBrowser is still busy
                if (webBrowser.IsBusy)
                    continue;

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered
            token.ThrowIfCancellationRequested();
            return html;
        }
    }
}

当前会引发此错误

我要关闭吗?上面有解决方法吗?

Am I close? Is there a fix for the above?

或者如果我不在轨道上,是否有现成的解决方案可以使用.NET(可以从类库中调用)获取动态Web内容?

Or if I am off the track, is there a ready solution to get dynamic web content using .NET (that can be called from a class library)?

推荐答案

这是我在Web应用程序中测试并正常工作的内容.

Here is what I tested in a web application and worked properly.

它在另一个线程中使用 WebBrowser 控件,并返回一个 Task< string> ,其中包含的内容在浏览器内容完全加载时完成:

It uses a WebBrowser control in another thread and returns a Task<string> containing which completes when the browser content load completely:

using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public class BrowserBasedWebScraper
{
    public static Task<string> LoadUrl(string url)
    {
        var tcs = new TaskCompletionSource<string>();
        Thread thread = new Thread(() => {
            try {
                Func<string> f = () => {
                    using (WebBrowser browser = new WebBrowser())
                    {
                        browser.ScriptErrorsSuppressed = true;
                        browser.Navigate(url);
                        while (browser.ReadyState != WebBrowserReadyState.Complete)
                        {
                            System.Windows.Forms.Application.DoEvents();
                        }
                        return browser.DocumentText;
                    }
                };
                tcs.SetResult(f());
            }
            catch (Exception e) {
                tcs.SetException(e);
            }
        });
        thread.SetApartmentState(ApartmentState.STA);
        thread.IsBackground = true;
        thread.Start();
        return tcs.Task;
    }
}

这篇关于在类库中使用WebBrowser进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:12
查看更多