通过 Google Playstore 在 R 中抓取网页

本文介绍了通过 Google Playstore 在 R 中抓取网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从 Google Play 商店中抓取我想要的几个应用评论的数据.

I want to scrape data from google play store of several app's review in which i want.

姓名字段

name field

他们得到了多少星星

评论他们写的

这是senerio的快照

#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS gradient_Selector to scrap the name section
Name_data_html <- html_nodes(webpage,'.kx8XBd .X43Kjb')

#Converting the Name data to text
Name_data <- html_text(Name_data_html)

#Look at the Name
head(Name_data)

但结果是

> head(Name_data)

character(0)

后来我尝试发现更多我发现 Name_data_html 有

later I try to discover more i found Name_data_html has

> Name_data_html
{xml_nodeset (0)}

我是网络抓取的新手，可以帮我解决这个问题吗！

I am new to web scraping can any help me out with this!

推荐答案

在分析你的代码和你贴的 URL 的源页面后，我认为你无法废弃任何东西的原因是因为内容是动态生成的所以 rvest 无法正确处理.

After analyzing your code and the source page of the URL you posted, I think that the reason you are unable to scrap anything is because the content is being generated dynamically so rvest cannot get it right.

这是我的解决方案:

#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of

#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'

# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()

# go to website
remDr$navigate(url)

# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()

# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()

# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")

# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()

# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)

在我的解决方案中，我使用的是 RSelenium，它能够像导航一样加载网页(而不是像 rvest 那样只下载它).这样，所有动态生成的内容都会被加载，加载后，您现在可以使用 rvest 检索它并废弃它.

In my solution, I'm using RSelenium, which is able to load the webpage as if you were navigating to it (instead of just downloading it like rvest). This way, all the dynamically-generated content is loaded and when is loaded, you can now retrieve it with rvest and scrap it.

如果您对我的解决方案有任何疑问，请告诉我！

If you have any doubts about my solution, just tell me!

希望有帮助！

这篇关于通过 Google Playstore 在 R 中抓取网页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！