我正在尝试以英语格式解析tempobet.com的数据。问题是,当我使用google rest客户端时,它会返回所需的html,但是,当我尝试通过Jsoup解析它时,它将以我的语言环境格式返回日期格式。这是测试代码
import java.io.IOException;
import java.util.Date;
import java.util.ListIterator;
import java.util.Locale;
import org.apache.commons.lang3.time.DateUtils;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
public class ParseHtmlTest {
@Test
public void testName() throws IOException {
Response response = Jsoup.connect("https://www.tempobet.com/league191_5_0.html")
.userAgent("Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36")
.execute();
Document doc = Jsoup.connect("https://www.tempobet.com/league191_5_0.html")
.userAgent("Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36")
.header("Accept-Language", "en-US")
.header("Accept-Encoding", "gzip,deflate,sdch")
.cookies(response.cookies())
.get();
Elements tableElement = doc.select("table[class=table-a]");
ListIterator<Element> trElementIterator = tableElement.select("tr:gt(2)").listIterator();
while (trElementIterator.hasNext()) {
ListIterator<Element> tdElementIterator = trElementIterator.next().select("td").listIterator();
while (tdElementIterator.hasNext()) {
System.out.println(tdElementIterator.next());
}
}
}
}
这是回应的示例
<td width="40" class="grey">21 Nis 20:00</td>
日期应为
"21 Apr 20:00"
。我将不胜感激。不管怎么说,还是要谢谢你 最佳答案
如果tempobet只看Accept-Language
标头,那可能就这么容易...
他们在不同的域中分别提供tr(tempobet22.com)和en(tempobet.com)。首次调用en-domain会重定向到tr-domain。如果您选择另一种语言,则它们将进行两次重定向和神奇的会话共享。对于第一次重定向,您需要来自第一个域的GAMBLINGSESS
cookie,对于第二个域,则需要第二个域。重定向后,Jsoup不知道这一点...
String userAgent = "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36";
// get a session for tr and en domain
String tempobetSession = Jsoup.connect("https://www.tempobet.com/").userAgent(userAgent).execute().cookie("GAMBLINGSESS");
String tempobet22Session = Jsoup.connect("https://www.tempobet22.com/").userAgent(userAgent).execute().cookie("GAMBLINGSESS");
// tell tr domain that we wont to go to en without following the redirect
String redirect = Jsoup.connect("https://www.tempobet22.com/?change_lang=https://www.tempobet.com/")
.userAgent(userAgent).cookie("GAMBLINGSESS", tempobet22Session).followRedirects(false).execute().header("Location");
// Redirect goes to en domain including our hashed tr-cookie as parameter - but this redirect needs a en-cookie
Response response = Jsoup.connect(redirect).userAgent(userAgent).cookie("GAMBLINGSESS", tempobetSession).execute();
// finally...
Document doc = Jsoup.connect("https://www.tempobet.com/league191_5_0.html").userAgent(userAgent).cookies(response.cookies()).get();