本文介绍了Webscraping使用BeautifulSoup的IMDB页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来WebScraping / Python和BeautifulSoup和我有困难让我的code工作。

我想刮的网址:来获得:


  • 名人的名称

  • 名人形象

  • 行业

  • 最好的作品

该页面在十名人。我不知道我做错了。

下面是我的code:

 进口的urllib2
从BS4进口BeautifulSoupURL ='http://m.imdb.com/feature/bornondatetest_url = urllib2.urlopen(URL)
readHtml = test_url.read()
test_url.close()汤= BeautifulSoup(readHtml)
#使用它追踪演员的数量
数= 0
#标签抓取结果中值present
人= soup.findChildren('节','海报名单)
#改变人变成一个iterator
iterperson = ITER(人[0] .findChildren('A'))#寻找'一'的iterperson。每一个'A'标签包含一个人的信息
对于在iterperson:
    imgSource = a.find('IMG')['src'中。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg
    人= a.findChildren('格','标签')
    标题=人[0] .find('跨','标题')。内容[0]
    ##职业=人[0] .find('格','细节')。内容[0] .split(,)
    ## bestWork =人[0] .find('格','细节')。内容[1] .split(,)    打印************ IMDB出生的人如今************* **********************
    #打印的人的S.No
    打印S.No. - > ',
    数+ = 1
    打印计数
    #打印的人的标题/名称
    打印标题 - > '+标题
    #打印的人的图像源
    打印图片来源 - > ',imgSource
    #打印的人的职业
    ##打印专业 - > '职业
    #打印的人的最好的工作
    ##打印最好的工作 - > ',bestWork

目前没有什么是越来越打印出来。
此外,如果这种含糊你能解释一下如何为实例做名人的只是名称?

这是第一个名人的HTML code是否有帮助:

 <节类=海报列表>
< H1> 3月7 LT; / H1>    < A HREF =/名/ nm0186505 /级=海报>< IMG src=\"http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg\"风格=背景:网址('http://i.media-imdb​​.com/images/mobile/people-40x59-fade.png')WIDTH =40HEIGHT =59>< D​​IV类=标签><跨度类=标题>布赖恩·克兰斯顿< / SPAN>< D​​IV CLASS =细节>演员,Ozymandias< / DIV>< / DIV>< / A>


解决方案

首先,屏幕抓取明确由IMDB的:

Try exploring the IMDb JSON API instead of a web-scraping approach.


Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb API and with a javascript logic involved.

The easiest option right now would be to switch to selenium browser automation tool. Working example using a headless PhantomJS browser:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

Prints:

http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

这篇关于Webscraping使用BeautifulSoup的IMDB页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 20:02