问题描述
我需要获取一些取决于GET请求中发送的Cookie的链接。
所以,当我想使用crawler4j抓取页面时,我需要发送一些cookie来获取正确的页面。
I need to grab some links that are depending on the sent cookies within a GET Request.So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back.
这是可能的网站为它,但没有找到有用的东西)?还是有一个能够执行此操作的Java爬虫?
Is this possible (I searched the web for it, but didn't find something useful)? Or is there a Java crawler out there who is capable doing this?
任何帮助。
推荐答案
似乎crawler4j可能不支持Cookie:
It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p-crawler4j-
有几种替代方案:
- Nutch
- Heritrix
- WebSPHINX
- JSpider
- WebEater
- WebLech
- 节肢动物
- JoBo
- 网络收获
- 前往履历
- Bixo
- Nutch
- Heritrix
- WebSPHINX
- JSpider
- WebEater
- WebLech
- Arachnid
- JoBo
- Web-Harvest
- Ex-Crawler
- Bixo
我会说Nutch和Heritrix是最好的,我会特别强调Nutch它可能是唯一的爬虫,它的设计规模很好,实际上执行一个大的爬网。
I would say that Nutch and Heritrix are the best ones and I would put special emphasis on Nutch, because it's probably one of the only crawlers that is designed to scale well and actually perform a big crawl.
这篇关于使用crawler4j在请求中发送Cookie?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!