问题描述
我正在尝试使用JSoup限制下载的页面/链接的大小,具体如下所示(Scala代码):
I'm trying to limit the size of a downloaded page/link with JSoup, given something like the following (Scala code):
val document = Jsoup.connect(theURL).get();
val document = Jsoup.connect(theURL).get();
我只想获取给定页面的前几个KB,然后停止尝试下载超出该范围的文件.如果页面很大(或者theURL
是不是html的链接,并且是大文件),我不想花时间下载其余页面.
I'd like to only get the first few KB of a given page, and stop trying to download beyond that. If there's a really large page (or theURL
is a link that isn't html, and is a large file), I'd like to not have to spend time downloading the rest.
我的用例是IRC机器人的页面标题缠结.
My usecase is a page title snarfer for an IRC bot.
奖金问题:
有什么原因导致Jsoup.connect(theURL).timeout(3000).get();
在大型文件上不超时?最终,如果有人粘贴了永无止境的音频流或大型ISO之类的东西(可以通过在另一个线程中获取URL标题(或使用Scala actor并在那里进行超时)来解决),则导致bot发出提示.当我认为timeout()
应该能够达到相同的最终结果时,对于一个非常简单的bot来说似乎有点过头了.
Is there any reason why Jsoup.connect(theURL).timeout(3000).get();
isn't timing out on large files? It ends up causing the bot to ping out if someone pastes something like a never-ending audio stream or a large ISO (which can be solved by fetching URL titles in a different thread (or using Scala actors and timing out there), but that seems like overkill for a very simple bot when I think timeout()
is supposed to accomplish the same end result).
推荐答案
现在,您可以使用maxBodySize()方法在版本1.7.2中限制最大正文大小. http://jsoup.org/apidocs/org/jsoup/Connection .Request.html#maxBodySize()默认情况下限制为1MB,这将防止内存泄漏.
Now you can limit the max body size with version 1.7.2 using maxBodySize() method.http://jsoup.org/apidocs/org/jsoup/Connection.Request.html#maxBodySize()By default is limited to 1MB and this will prevent from memory leaks.
这篇关于如何使用jsoup限制下载大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!