问题描述
我正在尝试使用python,漂亮的汤和硒从公用事业网站中抓取数据.我要抓取的数据是诸如时间,原因,状态等之类的.当我运行典型的页面请求时,解析页面,然后解析我要查找的数据(id ="OutageListTable"中的数据),然后打印出来,就找不到div和字符串了.当我检查page元素时,数据在那里,但是它在flex容器中.
I am trying to scrape data from a utility website using python, beautiful soup and selenium. The data that I am trying to scrape is stuff like: time, cause, status, etc. When I run a typical page request, parse the page, and parse the data that I am looking for (data in id="OutageListTable"), and print it, the divs and strings are nowhere to be found. When I inspect the page element, the data is there, but it is in a flex container.
这是我正在使用的代码:
This is the code that I am using:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib3
from selenium import webdriver
my_url = 'https://www.pse.com/outage/outage-map'
browser = webdriver.Firefox()
browser.get(my_url)
html = browser.page_source
page_soup = soup(html, features='lxml')
outage_list = page_soup.find(id='OutageListTable')
print(outage_list)
browser.quit()
您如何检索flex/flexbox容器中的信息?我在网上找不到任何资源可以帮助我解决这个问题.
How do you retrieve information that is in a flex/flexbox container? I am not finding any resources online to help me figure it out.
推荐答案
您正在考虑问题.首先,没有Flexboard容器.这是分配正确的div类的简单情况.您应该查看 div
class_ = col-xs-12 col-sm-6 col-md-4 listView-container
You are overthinking the problem. First there is no flexboard container. It's a simple case of assigning the right div class. You should be looking at div
class_=col-xs-12 col-sm-6 col-md-4 listView-container
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
# create object for chrome options
chrome_options = Options()
base_url = 'https://www.pse.com/outage/outage-map'
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
for item_n in soup.find_all('div', class_='col-xs-12 col-sm-6 col-md-4 listView-container'):
for item_n_text in item_n.find_all(name="span"):
print(item_n_text.text)
except TimeoutException:
print ("Loading took too much time!-Try again")
# close the automated browser
browser.close()
Cause:
Accident
Status:
Crew assigned
Last updated:
06/02 11:00 PM
9. Woodinville
Start time:
06/02 08:29 PM
Est. restoration time:
06/03 03:30 AM
Customers impacted:
2
Cause:
Under Investigation
Status:
Crew assigned
Last updated:
06/03 12:15 AM
Page is ready
1. Bellingham
Start time:
06/02 06:09 PM
Est. restoration time:
06/03 06:30 AM
Customers impacted:
1
Cause:
Trees/Vegetation
Status:
Crew assigned
Last updated:
06/02 11:50 PM
2. Deming
Start time:
06/02 07:10 PM
Est. restoration time:
06/03 03:30 AM
这篇关于如何使用Python和Beautiful Soup从flexbox元素/容器中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!