本文介绍了通过 RVest 抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望从 https 中按类别获取文章名称://www.inquirer.net/article-index?d=2020-6-13

我尝试通过以下方式读取文章名称:

I've attempted to read the article names by doing:

library('rvest')

year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")

 pg <- read_html(url)

 test<-pg %>%
  html_nodes("#index-wrap") %>%
  html_text()

这仅返回所有文章名称的 1 个字符串,并且非常混乱.

This returns only 1 string of all articles names and it's very messy.

我最终想要一个如下所示的数据框:

I ultimately would like to have a dataframe that looks like below:

       Date     Category      Article Name
 2020-06-13         News      ‘We can never let our guard down’ vs terrorism – Cayetano
 2020-06-13         News      PNP spox says mañanita remark did not intend to put Sinas in bad light
 2020-06-13         News      After stranded mom’s death, Pasay LGU helps over 400 stranded individuals
 2020-06-13        World      4 dead after tanker truck explodes on highway in China
 etc.
 etc.
 etc.
 etc.
 2020-06-13    Lifestyle     Book: Melania Trump delayed 2017 move to DC to get new prenup

有谁知道我可能遗漏了什么?非常新,谢谢!

Does anyone know what I may be missing? Very new to this, thanks!

推荐答案

这可能是你能得到的最接近的:

This is maybe the closest you can get:

library(rvest)
#> Loading required package: xml2
library(tibble)

year  <- 2020
month <- 06
day   <- 13
url   <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)

div       <- read_html(url) %>% html_node(xpath = '//*[@id ="index-wrap"]')
links     <- html_nodes(div, xpath = '//a[@rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[@class = "index-postdate"]') %>%
             html_text()

test <- tibble(date = post_date,
               text = html_text(links),
               link = html_attr(links, "href"))

test
#> # A tibble: 261 x 3
#>    date     text                              link
#>    <chr>    <chr>                             <chr>
#>  1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#>  2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#>  3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#>  4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#>  5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#>  6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#>  7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#>  8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#>  9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows

这篇关于通过 RVest 抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-28 02:38
查看更多