问题描述
我正在尝试检索Regulations.gov页面上的评论部分.一个示例是具有自由市场驱动的估值的专有交易限制..."段落.在 http://www.regulations.gov/#!documentDetail;上D = OCC-2011-0014-0032 .
I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.
我正在使用BeautifulSoup和Python,并具有以下代码:
I am using BeautifulSoup and Python and have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder
当我执行打印汤"时,我会得到一个输出(尽管是一团糟),但是当我执行打印注释者"时,我会得到无"作为输出.我不太确定为什么会这样,将不胜感激.谢谢.
When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.
注意:我使用Selenium Webdriver尝试绕过Javascript-这是正确的方法吗?
Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?
推荐答案
您需要让PhantomJS
明确地等待元素出现,然后再阅读page_source
.为我工作:
You need to let PhantomJS
explicitly wait for the element to become present before reading the page_source
. Worked for me:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))
这篇关于使用BeautifulSoup和Python解析文本时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!