使用 R (rvest) 导航和抓取

本文介绍了使用 R (rvest) 导航和抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试登录 stackoverflow 并在搜索栏上导航，通过 tidyverse 包进行搜索.

I am trying to log in in stackoverflow and navigating on the search bar, searching by tidyverse package.

主要问题是当我设置 url 时，它没有给我填写我的电子邮件和密码的表单:

The main problem is when I set the url, which is not giving me the form to fill with my email and my password:

所以 url<-"https://stackoverflow.com" 不起作用.我尝试了网址:url<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f" 这是当我单击底部的登录"时我拥有的 url，但是在使用 html_form 时我也找不到填写我的电子邮件和密码的表单.这是我的代码:

So url<-"https://stackoverflow.com" doesnt work. I tried the url: url<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f" which is the url that I have when I click on the the Log in bottom, but I also can't find the form to fill with my email and password when using html_form. This is my code:

    library(rvest)

    url<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

    (session <- html_session(url))

    (form <- html_form(read_html(url))[[1]])

    (filled_form <- set_values(form,email="myemail@gmail.com",pass="mypassword"))

    (form_submitted<-submit_form(session,filled_form))

    (submitted_url<-form_submitted$url)

    after_filled_html<-jump_to(session,submitted_url)

在此之后，我想按以下术语进行搜索:[tidyverse] 并开始抓取它.

And after this, I would like to do a search by the term: [tidyverse] and start scraping it.

我认为如果我解决了上面代码的问题，如果我解决了登录名/密码/表单问题，我将能够管理第二部分.

I think this second part I will be able to manage if I solve the problem of the code above if I fix the login/password/form problem.

任何帮助家伙

推荐答案

可以直接在URL中设置搜索词，无需登录stackoverflow:

You could directly set the search term in the URL, without need to log into stackoverflow :

library(rvest)

getStackQuestions <- function(search) {
  stackoverflow <- read_html(paste0('https://stackoverflow.com/questions/tagged/',search,'?tab=Newest'))
  questions <- stackoverflow %>% html_nodes(".question-hyperlink:not(.mb0)")
  question.href <- questions %>% html_attr('href')
  question.text <- questions %>% html_text()
  questions <- data.frame( text = question.text, href = paste0("https://stackoverflow.com",question.href))
  questions
}

tidyverse_questions <- getStackQuestions('tidyverse')

head(tidyverse_questions$text)
[1] "Python/Pandas equivalent of across and weighted average"
[2] "Transforming columns based off separate dataframe - R solution"
[3] "Group by summarize in between dates with dplyr"
[4] "Transpose complex data.frame with tidyR"
[5] "Create 1 composite variable derived from different combinations of values of 2nd variable that are separated by specific levels of 3rd variable"
[6] "extracting a cv.glmnet object from Tune_results"

这篇关于使用 R (rvest) 导航和抓取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！