问题描述
这是我想抓取的链接:http://www.prudential.com.hk/PruServlet?module=fund& purpose=searchHistFund&fundCd=MMFU_U
英文版"标签位于右上角,以显示网页的英文版.
The "English Version" tab is at the upper right hand corner in order to show the English version of the web page.
我必须按下一个按钮才能阅读网页上的资金信息.如果不是,则视图被阻塞,使用scrapy shell 总是导致空[].
There is a button I have to press in order to read the funds information on the web page. If not, the view is blocked, and using scrapy shell always result empty [].
<div onclick="AgreeClick()" style="width:200px; padding:8px; border:1px black solid;
background-color:#cccccc; cursor:pointer;">Confirmed</div>
AgreeClick 的作用是:
And the function of AgreeClick is:
function AgreeClick() {
var cookieKey = "ListFundShowDisclaimer";
SetCookie(cookieKey, "true", null);
Get("disclaimerDiv").style.display = "none";
Get("blankDiv").style.display = "none";
Get("screenDiv").style.display = "none";
//Get("contentTable").style.display = "block";
ShowDropDown();
如何克服这个 onclick="AgreeClick()" 函数来抓取网页?
How do I overcome this onclick="AgreeClick()" function to scrape the web page?
推荐答案
你不能直接点击scrapy里面的链接(见单击 Scrapy 中的按钮).
You cannot just click the link inside scrapy (see Click a Button in Scrapy).
首先,检查您需要的数据是否已经存在 - 在 html 中(它在后台 - 所以它在那里).
First of all, check if the data you need is already there - in the html (it is on the background - so it's there).
另一种选择是selenium:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get("http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=MMFU_U")
elem = browser.find_element_by_xpath('//*[@id="disclaimer"]/div/div')
elem.click()
time.sleep(0.2)
elem = browser.find_element_by_xpath("//*")
print elem.get_attribute("outerHTML")
另一种选择是使用机械化.它不能执行js代码,但是,根据源代码,AgreeClick
只是将cookieListFundShowDisclaimer
设置为true
.这是一个起点(不确定它是否有效):
One more option is to use mechanize. It cannot execute js code, but, according to the source code, AgreeClick
just sets the cookie ListFundShowDisclaimer
to true
. This is a starting point (not sure if it works):
import cookielib
import mechanize
br = mechanize.Browser()
cj = cookielib.CookieJar()
ck = cookielib.Cookie(version=0, name='ListFundShowDisclaimer', value='true', port=None, port_specified=False,
domain='www.prudential.com.hk', domain_specified=False, domain_initial_dot=False, path='/',
path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None,
rest={'HttpOnly': None}, rfc2109=False)
cj.set_cookie(ck)
br.set_cookiejar(cj)
br.open("http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=MMFU_U")
print br.response().read()
然后,您可以使用 BeautifulSoup
或任何您喜欢的方式解析结果.
Then, you can parse the result with BeautifulSoup
or whatever you prefer.
这篇关于新手:如何克服 Javascript “onclick"按钮来抓取网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!