HtmlUnit WebClient超时

本文介绍了HtmlUnit WebClient超时的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我之前有关HtmlUnit的问题中跳过HTML单元中的特定Javascript执行和使用HtmlUnit获取页面源:URL卡住了

In my previous questions about HtmlUnitSkip particular Javascript execution in HTML unitandFetch Page source using HtmlUnit : URL got stuck

我曾经提到URL被卡住了.我还发现，由于HtmlUnit库中的一种方法(解析)没有执行失败，因此卡住了.

I had mentioned that URL is getting stuck. I also found out that it is getting stuck due to one of the methods(parse) in HtmlUnit library is not coming out of execution.

我对此做了进一步的工作.如果要花费超过指定的超时秒数，我将编写代码以退出该方法.

I did further work on this. I wrote code to get out of the method if it takes more than specified time-out seconds to complete.

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HandleHtmlUnitTimeout {

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException, TimeoutException
    {
        Date start = new Date();
        String url = "http://ericaweiner.com/collections/";
        doWorkWithTimeout(url, 60);
    }

public static void doWorkWithTimeout(final String url, long timeoutSecs) throws InterruptedException, TimeoutException {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    //logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run()
            {
            try
                {
                getPageSource(url);
                }
            catch (Exception e)
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        //logger.warn("encountered problem while doing some work", e);
        throw new TimeoutException();
    }finally{
    executor.shutdownNow();
    }
}

public static void getPageSource(String productPageUrl)
    {
    try {
    if(productPageUrl == null)
        {
        productPageUrl = "http://ericaweiner.com/collections/";
        }

        WebClient wb = new WebClient(BrowserVersion.FIREFOX_3_6);
        wb.getOptions().setTimeout(120000);
        wb.getOptions().setJavaScriptEnabled(true);
        wb.getOptions().setThrowExceptionOnScriptError(true);
        wb.getOptions().setThrowExceptionOnFailingStatusCode(false);
        HtmlPage page = wb.getPage(productPageUrl);
        wb.waitForBackgroundJavaScript(4000);
        wb.closeAllWindows();
}
catch (FailingHttpStatusCodeException e)
    {
    e.printStackTrace();
    }
catch (MalformedURLException e)
    {
    e.printStackTrace();
    }
catch (IOException e)
    {
    e.printStackTrace();
    }
    }

}

这段代码确实来自doWorkWithTimeout(url，60);方法.但这不会终止.

This code does come out of doWorkWithTimeout(url, 60); method. But this does not terminate.

当我尝试使用以下代码调用类似的实现时:

When I try to call similiar implementation with following code:

import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import org.apache.log4j.Logger;


public class HandleScraperTimeOut {

private static Logger logger = Logger.getLogger(HandleScraperTimeOut .class);


public void doWork() throws InterruptedException {
    logger.info(new Date()+ "Starting worker method ");
    Thread.sleep(20000);
    logger.info(new Date()+ "Ending worker method ");
    //perform some long running task here...
}

public void doWorkWithTimeout(int timeoutSecs) {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run()
            {
            try
                {
                doWork();
                }
            catch (Exception e)
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        logger.warn("encountered problem while doing some work", e);
    }
    executor.shutdown();
}

public static void main(String a[])
    {
        HandleScraperTimeOut hcto = new HandleScraperTimeOut ();
        hcto.doWorkWithTimeout(30);

    }

}

如果任何人可以看一下并告诉我问题出在哪里，这将非常有帮助.

If anybody can have a look and tell me what is the issue, it will be really helpful.

有关问题的更多详细信息，您可以查看在中跳过特定的Javascript执行HTML单元和使用HtmlUnit获取页面源:URL卡住了

For more details about issue, you can look into Skip particular Javascript execution in HTML unitandFetch Page source using HtmlUnit : URL got stuck

更新1 奇怪的是:future.cancel(true);在两种情况下都返回TRUE.我的期望是:

Update 1Strange thing is : future.cancel(true); is returning TRUE in both cases.How I expected it to be was :

使用HtmlUnit，由于进程仍在挂起，它应该返回FALSE.
使用正常的Thread.sleep();自该过程以来，它应该返回TRUE被成功取消.

With HtmlUnit it should return FALSE since process is still hanging.
With normal Thread.sleep(); it should return TRUE since the processgot cancelled successfully.

更新2 它仅使用http://ericaweiner.com/collections/ URL挂起.如果我提供任何其他URL，即http://www.google.com，http://www.yahoo.com，那么它不起作用.在这种情况下，它会抛出IntruptedException并退出流程.

Update 2It only hangs with http://ericaweiner.com/collections/ URL. If I give any other URL i.e. http://www.google.com , http://www.yahoo.com , It does not hand. In these cases it throws IntruptedException and come out of the Process.

http://ericaweiner.com/collections/页面源似乎包含某些引起问题的元素.

It seems that http://ericaweiner.com/collections/ page source has certain elements which are causing problems.

ericaweiner

HtmlUnit WebClient超时

问题描述

推荐答案