本文介绍了如何使用Python和Beautiful Soup从flexbox元素/容器中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python,漂亮的汤和硒从公用事业网站中抓取数据.我要抓取的数据是诸如时间,原因,状态等之类的.当我运行典型的页面请求时,解析页面,然后解析我要查找的数据(id ="OutageListTable"中的数据),然后打印出来,就找不到div和字符串了.当我检查page元素时,数据在那里,但是它在flex容器中.

I am trying to scrape data from a utility website using python, beautiful soup and selenium. The data that I am trying to scrape is stuff like: time, cause, status, etc. When I run a typical page request, parse the page, and parse the data that I am looking for (data in id="OutageListTable"), and print it, the divs and strings are nowhere to be found. When I inspect the page element, the data is there, but it is in a flex container.

这是我正在使用的代码:

This is the code that I am using:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib3
from selenium import webdriver

my_url = 'https://www.pse.com/outage/outage-map'

browser = webdriver.Firefox()
browser.get(my_url)

html = browser.page_source
page_soup = soup(html, features='lxml')

outage_list = page_soup.find(id='OutageListTable')
print(outage_list)

browser.quit()

您如何检索flex/flexbox容器中的信息?我在网上找不到任何资源可以帮助我解决这个问题.

How do you retrieve information that is in a flex/flexbox container? I am not finding any resources online to help me figure it out.

推荐答案

您正在考虑问题.首先,没有Flexboard容器.这是分配正确的div类的简单情况.您应该查看 div class_ = col-xs-12 col-sm-6 col-md-4 listView-container

You are overthinking the problem. First there is no flexboard container. It's a simple case of assigning the right div class. You should be looking at div class_=col-xs-12 col-sm-6 col-md-4 listView-container

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

# create object for chrome options
chrome_options = Options()
base_url = 'https://www.pse.com/outage/outage-map'

chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
    "profile.default_content_setting_values.notifications": 2
    })
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                          options = chrome_options)
browser.get(base_url)
delay = 5 #secods

while True:
    try:
        WebDriverWait(browser, delay)
        print ("Page is ready")
        sleep(5)
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        #print(html)
        soup = BeautifulSoup(html, "html.parser")
        for item_n in soup.find_all('div', class_='col-xs-12 col-sm-6 col-md-4 listView-container'):
            for item_n_text in item_n.find_all(name="span"):
                print(item_n_text.text)
    except TimeoutException:
        print ("Loading took too much time!-Try again")
# close the automated browser
browser.close()

Cause:
Accident
Status:
Crew assigned
Last updated:
06/02 11:00 PM
9. Woodinville
Start time:
06/02 08:29 PM
Est. restoration time:
06/03 03:30 AM
Customers impacted:
2
Cause:
Under Investigation
Status:
Crew assigned
Last updated:
06/03 12:15 AM
Page is ready
1. Bellingham
Start time:
06/02 06:09 PM
Est. restoration time:
06/03 06:30 AM
Customers impacted:
1
Cause:
Trees/Vegetation
Status:
Crew assigned
Last updated:
06/02 11:50 PM
2. Deming
Start time:
06/02 07:10 PM
Est. restoration time:
06/03 03:30 AM

这篇关于如何使用Python和Beautiful Soup从flexbox元素/容器中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:47