本文介绍了HtmlUnit WebClient超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我之前有关HtmlUnit的问题中跳过HTML单元中的特定Javascript执行使用HtmlUnit获取页面源:URL卡住了

In my previous questions about HtmlUnitSkip particular Javascript execution in HTML unitandFetch Page source using HtmlUnit : URL got stuck

我曾经提到URL被卡住了.我还发现,由于HtmlUnit库中的一种方法(解析)没有执行失败,因此卡住了.

I had mentioned that URL is getting stuck. I also found out that it is getting stuck due to one of the methods(parse) in HtmlUnit library is not coming out of execution.

我对此做了进一步的工作.如果要花费超过指定的超时秒数,我将编写代码以退出该方法.

I did further work on this. I wrote code to get out of the method if it takes more than specified time-out seconds to complete.

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HandleHtmlUnitTimeout {

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException, TimeoutException
    {
        Date start = new Date();
        String url = "http://ericaweiner.com/collections/";
        doWorkWithTimeout(url, 60);
    }

public static void doWorkWithTimeout(final String url, long timeoutSecs) throws InterruptedException, TimeoutException {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    //logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run()
            {
            try
                {
                getPageSource(url);
                }
            catch (Exception e)
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        //logger.warn("encountered problem while doing some work", e);
        throw new TimeoutException();
    }finally{
    executor.shutdownNow();
    }
}

public static void getPageSource(String productPageUrl)
    {
    try {
    if(productPageUrl == null)
        {
        productPageUrl = "http://ericaweiner.com/collections/";
        }

        WebClient wb = new WebClient(BrowserVersion.FIREFOX_3_6);
        wb.getOptions().setTimeout(120000);
        wb.getOptions().setJavaScriptEnabled(true);
        wb.getOptions().setThrowExceptionOnScriptError(true);
        wb.getOptions().setThrowExceptionOnFailingStatusCode(false);
        HtmlPage page = wb.getPage(productPageUrl);
        wb.waitForBackgroundJavaScript(4000);
        wb.closeAllWindows();
}
catch (FailingHttpStatusCodeException e)
    {
    e.printStackTrace();
    }
catch (MalformedURLException e)
    {
    e.printStackTrace();
    }
catch (IOException e)
    {
    e.printStackTrace();
    }
    }

}

这段代码确实来自doWorkWithTimeout(url,60);方法.但这不会终止.

This code does come out of doWorkWithTimeout(url, 60); method. But this does not terminate.

当我尝试使用以下代码调用类似的实现时:

When I try to call similiar implementation with following code:

import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import org.apache.log4j.Logger;


public class HandleScraperTimeOut {

private static Logger logger = Logger.getLogger(HandleScraperTimeOut .class);


public void doWork() throws InterruptedException {
    logger.info(new Date()+ "Starting worker method ");
    Thread.sleep(20000);
    logger.info(new Date()+ "Ending worker method ");
    //perform some long running task here...
}

public void doWorkWithTimeout(int timeoutSecs) {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run()
            {
            try
                {
                doWork();
                }
            catch (Exception e)
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        logger.warn("encountered problem while doing some work", e);
    }
    executor.shutdown();
}

public static void main(String a[])
    {
        HandleScraperTimeOut hcto = new HandleScraperTimeOut ();
        hcto.doWorkWithTimeout(30);

    }

}

如果任何人可以看一下并告诉我问题出在哪里,这将非常有帮助.

If anybody can have a look and tell me what is the issue, it will be really helpful.

有关问题的更多详细信息,您可以查看在中跳过特定的Javascript执行HTML单元使用HtmlUnit获取页面源:URL卡住了

For more details about issue, you can look into Skip particular Javascript execution in HTML unitandFetch Page source using HtmlUnit : URL got stuck

更新1 奇怪的是:future.cancel(true);在两种情况下都返回TRUE.我的期望是:

Update 1Strange thing is : future.cancel(true); is returning TRUE in both cases.How I expected it to be was :

  • 使用HtmlUnit,由于进程仍在挂起,它应该返回FALSE.
  • 使用正常的Thread.sleep();自该过程以来,它应该返回TRUE被成功取消.
  • With HtmlUnit it should return FALSE since process is still hanging.
  • With normal Thread.sleep(); it should return TRUE since the processgot cancelled successfully.

更新2 它仅使用http://ericaweiner.com/collections/ URL挂起.如果我提供任何其他URL,即http://www.google.comhttp://www.yahoo.com,那么它不起作用.在这种情况下,它会抛出IntruptedException并退出流程.

Update 2It only hangs with http://ericaweiner.com/collections/ URL. If I give any other URL i.e. http://www.google.com , http://www.yahoo.com , It does not hand. In these cases it throws IntruptedException and come out of the Process.

http://ericaweiner.com/collections/页面源似乎包含某些引起问题的元素.

It seems that http://ericaweiner.com/collections/ page source has certain elements which are causing problems.

推荐答案

Future.cancel(boolean)返回:

Future.cancel(boolean) returns:

  • 如果无法取消任务(通常是因为该任务已正常完成),则为false
  • 否则为真

Cancelled表示线程在取消之前没有完成,canceled标志设置为true,并且如果需要,则线程被中断.

Cancelled means means the thread did not finish before cancel, the canceled flag was set to true and if requested the thread was interrupted.

中断线程意味着它称为Thread.interrupt,仅此而已. Future.cancel(boolean)不会检查线程是否真正停止.

Interrupt the thread menans it called Thread.interrupt and nothing more. Future.cancel(boolean) does not check if the thread actually stopped.

在这种情况下,取消返回true是正确的.

So it is right that cancel return true on that cases.

中断线程意味着它应尽快停止,但不强制执行.您可以尝试使其停止/失败,以关闭其所需的资源或其他东西.我通常使用从套接字读取(等待传入数据)的线程来执行此操作.我关闭了插座,所以它不再等待.

Interrupting a thread means it should stop as soon as possible but it is not enforced. You can try to make it stop/fail closing a resource it needs or something. I usually do that with a thread reading (waiting incoming data) from a socket. I close the socket so it stops waiting.

这篇关于HtmlUnit WebClient超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 20:05