使用 R 进行网页抓取 - 没有 HTML 可见

本文介绍了使用 R 进行网页抓取 - 没有 HTML 可见的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 R 抓取网站:

I am trying to use R scrape a website:

http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/GO/90000609234

它有几个包含大量信息的字段.我只对site do candidato"字段上方的网址感兴趣.在这个例子中，我想要的网址是:"http://vanderlansenador111.com.br"

It has several fields with lots of information. I am only interested in the url above the field "site do candidato". In this example, the url I want is:"http://vanderlansenador111.com.br"

问题是，没有 HTML(可见).所以，我不认为使用 rvest 有帮助(至少，我不知道如何使用它).有没有办法在不使用 selenium 的情况下抓取它(我从未使用过 Rselenium，并且在尝试运行它时遇到了一些问题).

The problem is, there is no HTML (visible). So, I don't think using rvest is helpful (at least, I don't know how to use it). Is there a way to scrape it without using selenium (I never used Rselenium and had some problems trying to run it).

指向任何方向，非常感谢.

Points to any direction much appreciated.

推荐答案

不要在 Selenium 上浪费时间.使用浏览器的开发者工具部分找到 XHR 请求:http://divulgacandcontas.tse.jus.br/divulga/rest/v1/candidatura/buscar/2018/GO/2022802018/candidato/90000609234

Don't waste your time with Selenium. Use the Developer Tools part of your browser to find the XHR request: http://divulgacandcontas.tse.jus.br/divulga/rest/v1/candidatura/buscar/2018/GO/2022802018/candidato/90000609234

然后使用 jsonlite::fromJSON():

str(jsonlite::fromJSON("http://divulgacandcontas.tse.jus.br/divulga/rest/v1/candidatura/buscar/2018/GO/2022802018/candidato/90000609234"))

str() 输出很大 &完全的.你应该能够在那里找到你需要的东西.

The str() output is large & complete. You should be able to find what you need there.

这篇关于使用 R 进行网页抓取 - 没有 HTML 可见的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！