问题描述
我正在抓取用 aspx 构建的马德里议会网站,但我不知道如何模拟点击链接以获取相应的政治人物.我试过这个:
导入scrapy类 AsambleaMadrid(scrapy.Spider):name = "Asamblea_Madrid"start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']定义解析(自我,响应):对于 id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):目标 = id.extract()url = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"formdata = {'__EVENTTARGET':目标,'__VIEWSTATE':/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ",'__事件验证':'/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8','__VIEWSTATEGENERATOR': 'BAB98CB3','__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB26C5E0D8D625BCB26C5E0D8B9D8B9D8C8D8D8C8D8C8D80D8B8yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)def takeEachParty(self, response):打印 response.css('ul.listadoVert02 ul li::text').extract()
进入网站的源代码,我可以看到链接的样子,以及它们如何发送 JavaScript 查询.这是我需要访问的链接之一:
<一个ID = ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" HREF =JavaScript的:WebForm_DoPostBackWithOptions(新WebForm_PostBackOptions(安培; QUOT; ctl00 $ M $ g_36ea0310_893d_4a19_9ed1_88a133d06423 $ ctl00 $ Repeater1 $ ctl00 $ lnk_Grupo&安培; QUOT;,&安培; QUOT;&安培;quot;, true, "", "" false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>
我已经阅读了很多关于这方面的文章,但问题可能是我的无知.
提前致谢.
已
解决方案:我终于做到了!将 Padraic Cunningham 非常有用的代码翻译成 Scrapy 方式.当我为 Scrapy 指定问题时,我想发布结果以防万一有人遇到和我一样的问题.
就这样:
导入scrapy导入 js2xml类 AsambleaMadrid(scrapy.Spider):name = "AsambleaMadrid"start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']定义解析(自我,响应):来源 = 回应hrefs = response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()form_data = self.validate(source)对于 href 中的引用:# js2xml 允许我们解析 JS 函数和参数,从而获取 __EVENTTARGETjs_xml = js2xml.parse(ref)_id = js_xml.xpath("//标识符[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]form_data["__EVENTTARGET"] = _id.texturl_diputado = 'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'# 在scrapy中发送POST的正确方法是使用FormRequestyield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')定义验证(自我,来源):# 这些字段是最低要求,因为不能硬编码data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),"__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),"__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract()," __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}返回数据def extract_party(self, response):来源 = 回应name = source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()印刷名称
我希望很清楚.再次感谢大家!
如果您查看 chrome 或 firebug 中发布到表单的数据,您可以看到 post 请求中传递的字段很多,有几个是必须的,必须从原始页面解析,解析来自div.sangria ul li a
标签的id是不够的,因为实际发布的数据略有不同,发布的内容是在Javascript函数中,WebForm_DoPostBackWithOptions
位于 href 而不是 id 属性中:
href='javascript:WebForm_DoPostBackWithOptions(newWebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl03$lnk_Grupo", "", true, "", "", false, true))'>
有时所有的下划线都用美元符号替换,因此很容易执行 str.replace 以按正确的顺序排列它们,但在这种情况下并非如此,我们可以使用正则表达式来解析但我喜欢 js2xml 库,它可以将 javascript 函数及其参数解析为 xml 树.>
以下使用请求的代码向您展示了如何从初始请求中获取数据并访问您想要的所有页面:
导入请求从 lxml 导入 html导入 js2xmlpost = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"定义验证(xml):# 这些字段是最低要求,因为不能硬编码数据 = {"__VIEWSTATEGENERATOR": xml.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0],"__EVENTVALIDATION": xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0],"__VIEWSTATE": xml.xpath("//*[@id='__VIEWSTATE']/@value")[0]," __REQUESTDIGEST": xml.xpath("//*[@id='__REQUESTDIGEST']/@value")[0]}返回数据使用 requests.Session() 作为 s:# 发出初始请求以获取链接/hrefs 和 from 字段r = s.get("http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")xml = html.fromstring(r.content)hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")表单数据 = 验证(xml)对于h in hrefs:js_xml = js2xml.parse(h)_id = js_xml.xpath("//标识符[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]form_data["__EVENTTARGET"] = _id.textr = s.post(post, data=form_data)xml = html.fromstring(r.content)打印(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
如果我们运行上面的代码,我们会看到所有锚标签的不同文本输出:
在 [2]: with requests.Session() as s:...: r = s.get(...:http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")...: xml = html.fromstring(r.content)...: hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")...:form_data = 验证(xml)...:对于h in hrefs:...: js_xml = js2xml.parse(h)...: _id = js_xml.xpath(...: "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[...:0]...:form_data["__EVENTTARGET"] = _id.text...: r = s.post(post, data=form_data)...: xml = html.fromstring(r.content)...: 打印(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))...:[u'Aboxedn Aboxedn,Sonsoles Trinidad',u'Adrados Gautier,Mxaa Paloma',u'Aguado Del Olmo,Mxaa Josefa',u'xc1lvarez Padilla,Mxaa Nadia',u'Arribas Del Barrio, Josxe9 Mxaa', u'Ballarxedn Valcxe1rcel, xc1lvaro Cxe9sar', u'Berrio Fernxe1ndez-Caballero, Mxaa Inxe9s', u'Berzal Andrade, Josxe9 Manuel', u'Camxedns Martxednez, Ana', u'Carballedo Berlanga, Mxaa Eugenia', 'Cifuentes Cuencas, Cristina', u'Dxedaz Ayuso, Isabel Natividad', u'Escudero Dxedaz-Tejeiro, Marta', u'Fermosel Dxedaz, Jesxfas', u'Fernxe1ndez-Quejo Del Pozo, Josxe9 Luis', u'Garcxeda De Vinuesa Gardoqui, Ignacio',u'Garcxeda Martxedn, Marxeda Begoxf1a', u'Garrido Garcxeda, xc1ngel', u'Gxf3mez Ruiz, Jesxfas', u'Gxf3mez-Angulo Rodrxedguez, Juan Antonio', u'Gonzxe1lez Gonzxe1lez, Isabel Gema', u'Gonzxe1lez Jimxe9nez, Bartolomxe9', u'Gonzxe1lez Taboada, Jaime', u'Gonzxe1lez-Moxf1ux Vxe1zquez, Elena', u'Gonzalo Lxf3pez, Rosalxeda', 'Izquierdo Torres, Carlos', u'Lixe9bana Montijano, Pilar', u'Marixf1o Ortega, Ana Isabel', u'Moraga Valiente, xc1lvaro', u'Muxf1oz Abrines, Pedro', u'Nxfaxf1ez Guijarro, Josxe9 Enrique', u'Olmo Flxf3rez, Luis Del', u'Ongil Cores, Mxaa Gador', 'Ortiz Espejo, Daniel', u'Ossorio Crespo, Enrique Matxedas', 'Peral Guerra, Luis', u'Pxe9rez Baos, Ana Isabel', u'Pxe9rez Garcxeda, David', u'Plaxf1iol De Lacalle, Regina Mxaa', u'Redondo Alcaide, Mxaa Isabel', u'Rollxe1n Ojeda, Pedro', u'Sxe1nchez Fernxe1ndez, Alejandro', 'Sanjuanbenito Bonal, Diego', u'Serrano Guio, Josxe9 Tomxe1s', u'Serrano Sxe1nchez-Capuchino, Alfonso Carlos',Espiauba Gallo, Juan', 'Toledo Moreno, Lucila', 'Van-Halen Acedo, Juan'][u'Andaluz Andaluz, Mxaa Isabel', u'Ardid Jimxe9nez, Mxaa Isabel', u'Carazo Gxf3mez, Mxf3nica', u'Casares Dxedaz, Mxaa LucxedaInmaculada', u'Cepeda Garcxeda De Lexf3n, Josxe9 Carmelo', 'Cruz Torrijos, Diego', u'Delgado Gxf3mez, Carla', u'Franco Pardo, Josxe9 Manuel', u'Freire Campo, Josxe9 Manuel', u'Gabilondo Pujol, xc1ngel', 'Gallizo Llamas, Mercedes', u"Garcxeda D'Atri, Ana", u'Garcxeda-Rojo Garrido, Pedro Pablo',u'Gxf3mez Montoya, Rafael', u'Gxf3mez-Chamorro Torres, Josxe9 xc1ngel', u'Gonzxe1lez Gonzxe1lez, Mxf3nica Silvana', u'Leal Fernxe1ndez, Mxaa Isaura', u'Llop Cuenca, Mxaa Pilar', 'Lobato Gandarias, Juan', u'Lxf3pez Ruiz, Mxaa Carmen', u'Manguan Valderrama, Eva Mxaa', u'Maroto Illera, Mxaa Reyes', u'Martxednez 10, Carmen', u'Mena Romero, Mxaa Carmen', u'Moreno Navarro, Juan Josxe9', u'Moya Nieto, Encarnacixf3n', 'Navarro Lanchas, Josefa', 'Nolla Estrada, Modesto', 'Pardo Ortiz, Josefa Dolores', u'Quintana Viar, Josxe9', u'Rico Garcxeda-Hierro, Enrique', u'Rodrxedguez Garcxeda, Nicolxe1s', u'Sxe1nchez Acera, Pilar', u'Santxedn Fernxe1ndez, Pedro', 'Segovia Noriega, Juan', 'Vicente Viondi,Daniel', u'Vinagre Alcxe1zar, Agustxedn']['Abasolo Pozas, Olga', 'Ardanuy Pizarro, Miguel', u'Beirak Ulanosky, Jazmxedn', u'Camargo Fernxe1ndez, Raxfal', 'Candela Pokorna, Marco', 'Delgado Orgaz, Emilio', u'Dxedaz Romxe1n, Laura', u'Espinar Merino, Ramxf3n', u'Espinosa De La Llave, Marxeda', u'Fernxe1ndez Rubixf1o, Eduardo', u'Garcxeda Gxf3mez, Mxf3nica', 'Gimeno Reinoso, Beatriz', u'Gutixe9rrez Benito, Eduardo', 'Huerta Bravo, Raquel', u'Lxf3pez Hernxe1ndez, Isidro', u'Lxf3pez Rodrigo, Josxe9 Manuel', u'Martxednez Abarca, Hugo', u'Morano Gonzxe1lez, Jacinto', u'Ongil Lxf3pez, Miguel', 'Padilla Estrada, Pablo', u'Ruiz-Huerta Garcxeda De Viedma, Lorena', 'Salazar-Alonso Revuelta, Cecilia', u'San Josxe9 Pxe9rez, Carmen', u'Sxe1nchez Pxe9rez, Alejandro', u'Serra Sxe1nchez,Isabel', u'Serra Sxe1nchez, Clara', 'Sevillano De Las Heras, Elena'][u'Aguado Crespo, Ignacio Jesxfas', u'xc1lvarez Cabo, Daniel', u'Gonzxe1lez Pastor, Dolores', u'Iglesia Vicente, Mxaa Teresa De La', 'Lara Casanova, Francisco', u'Marbxe1n De Frutos, Marta', u'Marcos Arias, Tomxe1s', u'Megxedas Morales, Jesxfas Ricardo', u'Nxfaxf1ez Sxe1nchez, Roberto', 'Reyero Zubiri, Alberto', u'Rodrxedguez Durxe1n, Ana', u'Rubio Ruiz, Juan Ramxf3n', u'Ruiz Fernxe1ndez, Esther', u'Solxeds Pxe9rez, Susana', 'Trinidad Martos, Juan', 'Veloso Lozano, Enrique', u'Zafra Hernxe1ndez, Cxe9sar']
您可以向蜘蛛添加完全相同的逻辑,我只是使用请求向您展示了一个工作示例.您还应该注意,并非每个 asp.net 站点的行为都相同,您可能需要重新验证每个帖子,如相关 回答.
I'm scraping the Madrid Assembly's website, built in aspx, and I have no idea how to simulate clicks on the links where I need to get the corresponding politicians from. I tried this:
import scrapy
class AsambleaMadrid(scrapy.Spider):
name = "Asamblea_Madrid"
start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']
def parse(self, response):
for id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):
target = id.extract()
url = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"
formdata= {'__EVENTTARGET': target,
'__VIEWSTATE': '/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ',
'__EVENTVALIDATION': '/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8',
'__VIEWSTATEGENERATOR': 'BAB98CB3',
'__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB2BE0D88C6E150945E8F5D82C189B56A0DA4BC859D'}
yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)
def takeEachParty(self, response):
print response.css('ul.listadoVert02 ul li::text').extract()
Going into the source code of the website, I can see how links look like, and how they send the JavaScript query. This is one of the links I need to access:
<a id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl00$lnk_Grupo", "", true, "", "", false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>
I have been reading so many articles about, but probably the problem is my ignorance in respect.
Thanks in advance.
EDITED:
SOLUTION: I finally did it! Translating the very helpul code from Padraic Cunningham into Scrapy way. As I specified the issue for Scrapy, I want to post the result just in case someone has the same problem as I had.
So here it goes:
import scrapy
import js2xml
class AsambleaMadrid(scrapy.Spider):
name = "AsambleaMadrid"
start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']
def parse(self, response):
source = response
hrefs = response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()
form_data = self.validate(source)
for ref in hrefs:
# js2xml allows us to parse the JS function and params, and so to grab the __EVENTTARGET
js_xml = js2xml.parse(ref)
_id = js_xml.xpath(
"//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]
form_data["__EVENTTARGET"] = _id.text
url_diputado = 'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'
# The proper way to send a POST in scrapy is by using the FormRequest
yield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')
def validate(self, source):
# these fields are the minimum required as cannot be hardcoded
data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),
"__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),
"__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract(),
" __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}
return data
def extract_parties(self, response):
source = response
name = source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()
print name
I hope is clear. Thanks everybody, again!
If you look at the data posted to the form in chrome or firebug you can see there are many fields passed in the post request, there are a few that are essential and must be parsed from the original page, parsing the ids from the div.sangria ul li a
tags is not sufficient as the actual data posted is slightly different, what is posted is in the Javascript function, WebForm_DoPostBackWithOptions
which is in the href not the id attribute:
href='javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl03$lnk_Grupo", "", true, "", "", false, true))'>
Sometimes all the underscores are replaced with dollar signs so it is easy to do a str.replace to get them in the correct order but not really in this case, we could use a regex to parse but I like the js2xml lib which can parse a javascript function and its args into an xml tree.
The following code using requests shows you how can get the data from the initial request and get to all the pages you want:
import requests
from lxml import html
import js2xml
post = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"
def validate(xml):
# these fields are the minimum required as cannot be hardcoded
data = {"__VIEWSTATEGENERATOR": xml.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0],
"__EVENTVALIDATION": xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0],
"__VIEWSTATE": xml.xpath("//*[@id='__VIEWSTATE']/@value")[0],
" __REQUESTDIGEST": xml.xpath("//*[@id='__REQUESTDIGEST']/@value")[0]}
return data
with requests.Session() as s:
# make initial requests to get the links/hrefs and the from fields
r = s.get(
"http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
xml = html.fromstring(r.content)
hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
form_data = validate(xml)
for h in hrefs:
js_xml = js2xml.parse(h)
_id = js_xml.xpath(
"//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
0]
form_data["__EVENTTARGET"] = _id.text
r = s.post(post, data=form_data)
xml = html.fromstring(r.content)
print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
If we run the code above we see the different text output from all teh anchor tags:
In [2]: with requests.Session() as s:
...: r = s.get(
...: "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
...: xml = html.fromstring(r.content)
...: hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
...: form_data = validate(xml)
...: for h in hrefs:
...: js_xml = js2xml.parse(h)
...: _id = js_xml.xpath(
...: "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
...: 0]
...: form_data["__EVENTTARGET"] = _id.text
...: r = s.post(post, data=form_data)
...: xml = html.fromstring(r.content)
...: print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
...:
[u'Aboxedn Aboxedn, Sonsoles Trinidad', u'Adrados Gautier, Mxaa Paloma', u'Aguado Del Olmo, Mxaa Josefa', u'xc1lvarez Padilla, Mxaa Nadia', u'Arribas Del Barrio, Josxe9 Mxaa', u'Ballarxedn Valcxe1rcel, xc1lvaro Cxe9sar', u'Berrio Fernxe1ndez-Caballero, Mxaa Inxe9s', u'Berzal Andrade, Josxe9 Manuel', u'Camxedns Martxednez, Ana', u'Carballedo Berlanga, Mxaa Eugenia', 'Cifuentes Cuencas, Cristina', u'Dxedaz Ayuso, Isabel Natividad', u'Escudero Dxedaz-Tejeiro, Marta', u'Fermosel Dxedaz, Jesxfas', u'Fernxe1ndez-Quejo Del Pozo, Josxe9 Luis', u'Garcxeda De Vinuesa Gardoqui, Ignacio', u'Garcxeda Martxedn, Marxeda Begoxf1a', u'Garrido Garcxeda, xc1ngel', u'Gxf3mez Ruiz, Jesxfas', u'Gxf3mez-Angulo Rodrxedguez, Juan Antonio', u'Gonzxe1lez Gonzxe1lez, Isabel Gema', u'Gonzxe1lez Jimxe9nez, Bartolomxe9', u'Gonzxe1lez Taboada, Jaime', u'Gonzxe1lez-Moxf1ux Vxe1zquez, Elena', u'Gonzalo Lxf3pez, Rosalxeda', 'Izquierdo Torres, Carlos', u'Lixe9bana Montijano, Pilar', u'Marixf1o Ortega, Ana Isabel', u'Moraga Valiente, xc1lvaro', u'Muxf1oz Abrines, Pedro', u'Nxfaxf1ez Guijarro, Josxe9 Enrique', u'Olmo Flxf3rez, Luis Del', u'Ongil Cores, Mxaa Gador', 'Ortiz Espejo, Daniel', u'Ossorio Crespo, Enrique Matxedas', 'Peral Guerra, Luis', u'Pxe9rez Baos, Ana Isabel', u'Pxe9rez Garcxeda, David', u'Plaxf1iol De Lacalle, Regina Mxaa', u'Redondo Alcaide, Mxaa Isabel', u'Rollxe1n Ojeda, Pedro', u'Sxe1nchez Fernxe1ndez, Alejandro', 'Sanjuanbenito Bonal, Diego', u'Serrano Guio, Josxe9 Tomxe1s', u'Serrano Sxe1nchez-Capuchino, Alfonso Carlos', 'Soler-Espiauba Gallo, Juan', 'Toledo Moreno, Lucila', 'Van-Halen Acedo, Juan']
[u'Andaluz Andaluz, Mxaa Isabel', u'Ardid Jimxe9nez, Mxaa Isabel', u'Carazo Gxf3mez, Mxf3nica', u'Casares Dxedaz, Mxaa Lucxeda Inmaculada', u'Cepeda Garcxeda De Lexf3n, Josxe9 Carmelo', 'Cruz Torrijos, Diego', u'Delgado Gxf3mez, Carla', u'Franco Pardo, Josxe9 Manuel', u'Freire Campo, Josxe9 Manuel', u'Gabilondo Pujol, xc1ngel', 'Gallizo Llamas, Mercedes', u"Garcxeda D'Atri, Ana", u'Garcxeda-Rojo Garrido, Pedro Pablo', u'Gxf3mez Montoya, Rafael', u'Gxf3mez-Chamorro Torres, Josxe9 xc1ngel', u'Gonzxe1lez Gonzxe1lez, Mxf3nica Silvana', u'Leal Fernxe1ndez, Mxaa Isaura', u'Llop Cuenca, Mxaa Pilar', 'Lobato Gandarias, Juan', u'Lxf3pez Ruiz, Mxaa Carmen', u'Manguan Valderrama, Eva Mxaa', u'Maroto Illera, Mxaa Reyes', u'Martxednez Ten, Carmen', u'Mena Romero, Mxaa Carmen', u'Moreno Navarro, Juan Josxe9', u'Moya Nieto, Encarnacixf3n', 'Navarro Lanchas, Josefa', 'Nolla Estrada, Modesto', 'Pardo Ortiz, Josefa Dolores', u'Quintana Viar, Josxe9', u'Rico Garcxeda-Hierro, Enrique', u'Rodrxedguez Garcxeda, Nicolxe1s', u'Sxe1nchez Acera, Pilar', u'Santxedn Fernxe1ndez, Pedro', 'Segovia Noriega, Juan', 'Vicente Viondi, Daniel', u'Vinagre Alcxe1zar, Agustxedn']
['Abasolo Pozas, Olga', 'Ardanuy Pizarro, Miguel', u'Beirak Ulanosky, Jazmxedn', u'Camargo Fernxe1ndez, Raxfal', 'Candela Pokorna, Marco', 'Delgado Orgaz, Emilio', u'Dxedaz Romxe1n, Laura', u'Espinar Merino, Ramxf3n', u'Espinosa De La Llave, Marxeda', u'Fernxe1ndez Rubixf1o, Eduardo', u'Garcxeda Gxf3mez, Mxf3nica', 'Gimeno Reinoso, Beatriz', u'Gutixe9rrez Benito, Eduardo', 'Huerta Bravo, Raquel', u'Lxf3pez Hernxe1ndez, Isidro', u'Lxf3pez Rodrigo, Josxe9 Manuel', u'Martxednez Abarca, Hugo', u'Morano Gonzxe1lez, Jacinto', u'Ongil Lxf3pez, Miguel', 'Padilla Estrada, Pablo', u'Ruiz-Huerta Garcxeda De Viedma, Lorena', 'Salazar-Alonso Revuelta, Cecilia', u'San Josxe9 Pxe9rez, Carmen', u'Sxe1nchez Pxe9rez, Alejandro', u'Serra Sxe1nchez, Isabel', u'Serra Sxe1nchez, Clara', 'Sevillano De Las Heras, Elena']
[u'Aguado Crespo, Ignacio Jesxfas', u'xc1lvarez Cabo, Daniel', u'Gonzxe1lez Pastor, Dolores', u'Iglesia Vicente, Mxaa Teresa De La', 'Lara Casanova, Francisco', u'Marbxe1n De Frutos, Marta', u'Marcos Arias, Tomxe1s', u'Megxedas Morales, Jesxfas Ricardo', u'Nxfaxf1ez Sxe1nchez, Roberto', 'Reyero Zubiri, Alberto', u'Rodrxedguez Durxe1n, Ana', u'Rubio Ruiz, Juan Ramxf3n', u'Ruiz Fernxe1ndez, Esther', u'Solxeds Pxe9rez, Susana', 'Trinidad Martos, Juan', 'Veloso Lozano, Enrique', u'Zafra Hernxe1ndez, Cxe9sar']
You can add the exact same logic to your spider, I just used requests to show you a working example. You should also be aware that not every asp.net site behaves the same, you may have to re-validate for every post as in this related answer.
这篇关于Scrapy 爬行在 ASPX 网站上不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!