问题描述
如果您访问 http://www.imdb.com/title/tt2375692/episodes?season = 1 在这里,那么您会看到第1集第1集的发布日期是2014年1月25日,
If you visit http://www.imdb.com/title/tt2375692/episodes?season=1here, then you will see that season 1,episode 1's publish date is 25 Jan. 2014,
这是我用来抓取的代码.
This is the code I am using to scrape.
req = urllib2.Request('http://www.imdb.com/title/tt2375692/episodes?season=1')
self.diziPage = urllib2.urlopen(req).read()
self.diziSoup = BeautifulSoup(self.diziPage,from_encoding="utf8")
在我抓取网站后,除了播出日期外,一切都很好,第1集的播出日期为2014年4月20日,当我访问时还不存在,其余所有信息均已发布.
After I scrape the site, everything is fine except the airdate,episode 1 's airdate comes out 20 April 2014, which is not in present when I visit, all of the rest informations comes corrent.
我认为可能是因为标头,我做了一些实验,但是没有用.
I thought it may be because of headers I did some experiments but that didnt work.
推荐答案
好像,imdb根据访问者的位置提供不同的播出日期.这就是为什么我要获取不同的数据的原因,我认为他们检查访问者的ip或其他内容.
Seems like, imdb provides different air dates according to visitors location.This is why I m getting different data, I think they check visitor's ip or something.
这篇关于使用Beautifulsoup进行网页抓取,带来了不同的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!