本文介绍了刮台与BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在第一个代码中,我可以使用BS获取感兴趣表中的所有信息:

In this first code, I can use BS to get all the info within the table of interest:

from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html)

for i in soup.find("table",{"id":"giftList"}).children:
    print child

将打印产品列表.

我想打印 tournamentTable 中的行此处(所需信息位于 class = deactivate 中, class = odd停用中,日期位于 class = center边框中):

I want to print the rows in the tournamentTable here (desired info is in class=deactivate, class=odd deactivate and date in class=center nob-border):

from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.oddsportal.com/hockey/russia/khl/results/#/page/2.html")
soup = BeautifulSoup(html)

#for i in soup.find("table",{"id":"tournamentTable"}).children:
#    print i
for i in soup.find("table",{"class":"table-main"}).children:
    print i

但是,这是在页面上打印其他表格.当我尝试使用 {"id":"tournamentTable"} 指定感兴趣的表时,它将返回 Nonetype .

But that's printing other tables on the page. When I try to specify the table of interest with {"id":"tournamentTable"} it returns Nonetype.

我缺少我无法访问所需表&的地方?里面的信息?

What am I missing that I can't access the desired table & the information within?

推荐答案

urllib.urlopen 返回网页的内容时,它会从URL中返回HTML,并关闭JavaScript .在您的情况下,这意味着当 urllib 加载相关的URL时,带有 id ="tournamentTable" 的表实际上不会加载.

When urllib.urlopen returns the content of a webpage, it returns the HTML from a URL with JavaScript turned off. In your case, this means that when urllib loads the relevant URL, the table with id="tournamentTable" never actually loads.

您可以通过在浏览器中关闭JavaScript 并加载URL来观察这种行为.

You can observe this behaviour by turning off JavaScript in your browser and loading the URL.

要抓取包含JavaScript呈现的内容的网页,您可能需要考虑使用诸如 Selenium 之类的浏览器自动化程序包.如果您定期抓取图片,则可能还需要下载一个"JavaScript切换器"插件,该插件可让您轻松地打开和关闭JavaScript.

To scrape a webpage with content rendered by JavaScript you might want to consider using a browser automation package such as Selenium. If you scrape regularly you might also want to download a 'JavaScript switcher' plugin which allows you to toggle JavaScript on and off with ease.

这篇关于刮台与BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 09:58