本文介绍了Webscraping使用BeautifulSoup的IMDB页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是新来WebScraping / Python和BeautifulSoup和我有困难让我的code工作。
我想刮的网址:来获得:
- 名人的名称
- 名人形象
- 行业
- 最好的作品
该页面在十名人。我不知道我做错了。
下面是我的code:
进口的urllib2
从BS4进口BeautifulSoupURL ='http://m.imdb.com/feature/bornondatetest_url = urllib2.urlopen(URL)
readHtml = test_url.read()
test_url.close()汤= BeautifulSoup(readHtml)
#使用它追踪演员的数量
数= 0
#标签抓取结果中值present
人= soup.findChildren('节','海报名单)
#改变人变成一个iterator
iterperson = ITER(人[0] .findChildren('A'))#寻找'一'的iterperson。每一个'A'标签包含一个人的信息
对于在iterperson:
imgSource = a.find('IMG')['src'中。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg
人= a.findChildren('格','标签')
标题=人[0] .find('跨','标题')。内容[0]
##职业=人[0] .find('格','细节')。内容[0] .split(,)
## bestWork =人[0] .find('格','细节')。内容[1] .split(,) 打印************ IMDB出生的人如今************* **********************
#打印的人的S.No
打印S.No. - > ',
数+ = 1
打印计数
#打印的人的标题/名称
打印标题 - > '+标题
#打印的人的图像源
打印图片来源 - > ',imgSource
#打印的人的职业
##打印专业 - > '职业
#打印的人的最好的工作
##打印最好的工作 - > ',bestWork
目前没有什么是越来越打印出来。
此外,如果这种含糊你能解释一下如何为实例做名人的只是名称?
这是第一个名人的HTML code是否有帮助:
<节类=海报列表>
< H1> 3月7 LT; / H1> < A HREF =/名/ nm0186505 /级=海报>< IMG src=\"http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg\"风格=背景:网址('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')WIDTH =40HEIGHT =59>< DIV类=标签><跨度类=标题>布赖恩·克兰斯顿< / SPAN>< DIV CLASS =细节>演员,Ozymandias< / DIV>< / DIV>< / A>
解决方案
首先,屏幕抓取明确由IMDB的:
Try exploring the IMDb JSON API instead of a web-scraping approach.
Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb
API and with a javascript logic involved.
The easiest option right now would be to switch to selenium
browser automation tool. Working example using a headless PhantomJS
browser:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")
# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))
# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'
person = a.find_element_by_css_selector('div.detail').text
title = a.find_element_by_css_selector('span.title').text
print img, person, title
Prints:
http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
这篇关于Webscraping使用BeautifulSoup的IMDB页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!