无法使用BeautifulSoup抓取此电影网站 | 无法使用BeautifulSoup抓取此电影网站

本文介绍了无法使用BeautifulSoup抓取此电影网站的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正尝试在此处抓取电影网站: http://www.21cineplex.com/nowplaying

I am trying to scrap a movie website here: http://www.21cineplex.com/nowplaying

在此问题中，我已将屏幕截图上传为HTML正文.链接到此处的屏幕截图我很难获取<P>标记中的电影标题和说明.由于某些奇怪的原因，该描述不是请求对象的一部分.另外，当我尝试使用汤来查找ul和类名时，找不到它.有人知道为什么吗?我正在使用python3.到目前为止，这是我的代码:

I have uploaded the screenshot with the HTML body as the image in this questions.link to screenshot here I am having difficulty trying to grab the movie title and the description which is part of the <P> tag. For some strange reason, the description is not part of requests object. Also when I tried to use soup to find the ul and class name it cannot be found. Anyone know why? I am using python 3. This is my code so far:

    r = requests.get('http://www.21cineplex.com/nowplaying')
    r.text (no description here)
    soup = bs4.BeautifulSoup(r.text)
    soup.find('ul', class_='w462') # why is this empty?

推荐答案

此服务器正在检查Referer标头.如果没有Referer，它将发送主页.但是它不会检查此标头中的文本，因此它甚至可以是空字符串.

This server is checking Referer header. If there is no Referer it sends main page. But it doesn't check text in this header so it can be even empty string.

import requests
import bs4

headers = {
    #'Referer': any url (or even random text, or empty string)

    #'Referer': 'http://google.com',
    #'Referer': 'http://www.21cineplex.com',
    #'Referer': 'hello world!',
    'Referer': '',
}

s = requests.get('http://www.21cineplex.com/nowplaying', headers=headers)
soup = bs4.BeautifulSoup(s.text)

for x in soup.find_all('ul', class_='w462'):
    print(x.text)

for x in soup.select('ul.w462'):
    print(x.text)

for x in soup.select('ul.w462'):
    print(x.select('a')[0].text)
    print(x.select('p')[0].text)

这篇关于无法使用BeautifulSoup抓取此电影网站的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！