问题描述
我正在使用 Apache HttpComponents 来获取一些已抓取网址的网页.许多这些 URL 实际上重定向到不同的 URL(例如,因为它们已被 URL 缩短器处理过).除了下载内容之外,我还想解析最终 URL(即提供下载内容的 URL),或者更好的是解析重定向链中的所有 URL.
I'm using Apache HttpComponents to GET some web pages for some crawled URLs. Many of those URLs actually redirect to different URLs (e.g. because they have been processed with a URL shortener). Additionally to downloading the content, I would like to resolve the final URLs (i.e. the URL which provided the downloaded content), or even better, all URLs in the redirect chain.
我一直在浏览 API 文档,但不知道我可以在哪里挂钩.任何提示将不胜感激.
I have been looking through the API docs, but got no clue, where I could hook. Any hints would be greatly appreciated.
推荐答案
这里是完整演示关于如何使用 Apache HttpComponents 做到这一点.
Here's a full demo of how to do it using Apache HttpComponents.
您需要像这样扩展 DefaultRedirectStrategy
:
class SpyStrategy extends DefaultRedirectStrategy {
public final Deque<URI> history = new LinkedList<>();
public SpyStrategy(URI uri) {
history.push(uri);
}
@Override
public HttpUriRequest getRedirect(
HttpRequest request,
HttpResponse response,
HttpContext context) throws ProtocolException {
HttpUriRequest redirect = super.getRedirect(request, response, context);
history.push(redirect.getURI());
return redirect;
}
}
expand
方法发送 HEAD 请求,导致 client
在 spy.history
双端队列中收集 URI,因为它会自动跟随重定向:
expand
method sends a HEAD request which causes client
to collect URIs in spy.history
deque as it follows redirects automatically:
public static Deque<URI> expand(String uri) {
try {
HttpHead head = new HttpHead(uri);
SpyStrategy spy = new SpyStrategy(head.getURI());
DefaultHttpClient client = new DefaultHttpClient();
client.setRedirectStrategy(spy);
// FIXME: the following completely ignores HTTP errors:
client.execute(head);
return spy.history;
}
catch (IOException e) {
throw new RuntimeException(e);
}
}
您可能希望将重定向的最大数量设置为合理的(而不是默认值 100),如下所示:
You may want to set maximum number of redirects followed to something reasonable (instead of the default of 100) like so:
BasicHttpParams params = new BasicHttpParams();
params.setIntParameter(ClientPNames.MAX_REDIRECTS, 5);
DefaultHttpClient client = new DefaultHttpClient(params);
这篇关于在 Apache HttpComponents 中获取重定向的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!