我正在尝试使用可抓取的蜘蛛从列表页面抓取到产品页面的链接。该页面显示了前10台计算机,并有一个用于“显示所有计算机”的按钮,该按钮调用了一些JavaScript。 JavaScript相当复杂(即,我不能只看一下函数并查看按钮所指向的网址)。我正在尝试使用Selenium Webdriver来模拟对按钮的单击,但是由于某些原因它无法正常工作。刮产品链接时,我只会得到前10个,而不是完整列表。
谁能告诉我为什么它不起作用?
我要抓取的页面是http://www.ncservice.com/en/second-hand-milling-machines
蜘蛛是
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest
from scrapy import log
from scrapy.exceptions import DropItem
from scrapy import signals
from mtispider.items import MachineItem
import urlparse
import time
import MySQLdb
import unicodedata
import re
from mtispider import tools
from selenium import webdriver
class MachineSpider(CrawlSpider):
name = 'nc-spider'
allowed_domains = ['ncservice.com']
def start_requests(self):
requests = list(super(MachineSpider, self).start_requests())
requests.append(Request('http://www.ncservice.com/en/second-hand-milling-machines', callback=self.parsencmilllist))
return requests
def parsencmilllist(self,response):
hxs=HtmlXPathSelector(response)
driver= webdriver.Firefox()
driver.get(response.url)
try:
driver.FindElement(By.Id("mas-resultados-fresadoras")).Click()
except:
log.msg("Couldnt get all the machines", level=log.INFO)
ncmachs = hxs.select('//div[@id="resultados"]//a/@href').extract()
for ncmach in ncmachs:
yield Request(ncmach,
meta = {'type':'Milling'},
callback=self.parsencmachine)
driver.quit()
def parsencmachine(self,response):
#scrape the machine
return item
谢谢!
最佳答案
主要问题是您需要从Webdriver的Selector
初始化page_source
,而不是传递给回调的response
:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from scrapy import Selector
from selenium import webdriver
class MachineSpider(CrawlSpider):
name = 'nc-spider'
allowed_domains = ['ncservice.com']
def start_requests(self):
yield Request('http://www.ncservice.com/en/second-hand-milling-machines',
callback=self.parsencmilllist)
def parsencmilllist(self, response):
driver = webdriver.Firefox()
driver.get(response.url)
driver.find_element_by_id("mas-resultados-fresadoras").click()
sel = Selector(text=driver.page_source)
driver.quit()
links = sel.xpath('//div[@id="resultados"]//a/@href').extract()
for link in links:
yield Request(link,
meta={'type': 'Milling'},
callback=self.parsencmachine)
def parsencmachine(self, response):
print response.url
关于javascript - Selenium Click()无法与刮板蜘蛛一起使用,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28742906/