问题描述
我想抓取的表格中有网址.如果我运行代码,我只会得到带有 url 描述的列.如何获取实际具有一列(在我的情况下为第二列)的表格,其中包含 URL 而不是它们的描述),或具有完整的锚点 html 代码?.我需要它从表的第二列中的 URL 中提取两个索引代码.我想抓取的链接如下: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&Orthoped=15=False&Code=150000001 并且我需要 ProviderId 和代码编号,但首先我需要表格中由下面的代码抓取的链接.
The tables I would like to scrape have url's in them. If I run the code, I get only the column with description of url. How to get the table which actually has a column (in mycase the second column) with URLs instead of their descriptions), or having a full html code of an anchor?. I need it to extract two index codes from the URL's in the second column of table. The links that I would like to scrape look like: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=150000001 and I need ProviderId and Code numbers but fist I need the links in the table scraped by the code below.
table<-0
library(rvest)
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
page<-html_session(url)
table[i]<-html_table(page)
}
感谢所有评论和帮助.
推荐答案
这个 shld 有助于获得一个漂亮、干净、完整的表格,其中包含您想要的 hrefs:
This shld help get a nice, clean, complete table with the hrefs you want:
library(rvest)
library(tidyverse)
# Helpers
rm_extra <- function(x) { gsub("\r.*$", "", x) }
mk_gd_col_names <- function(x) {
tolower(x) %>%
gsub("\ +", "_", .)
}
URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"
get_table <- function(page_num = 1) {
pg <- read_html(sprintf(URL, page_num))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,11)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
as_tibble()
}
pb <- progress_estimated(10)
map_df(1:10, function(i) {
pb$tick()$print()
get_table(page_num = i)
}) -> full_df
glimpse(full_df)
## Observations: 93
## Variables: 10
## $ kod <chr> "150000016", "150005039", "1500046...
## $ nazwa_świadczeniodawcy <chr> "SAMODZIELNY PUBLICZNY ZAKŁAD OPIE...
## $ miasto <chr> "GRODZISK WIELKOPOLSKI", "KALISZ",...
## $ ulica <chr> "MOSSEGO 17", "POZNAŃSKA 23", "OS....
## $ kod_pocztowy <chr> "62065", "62800", "60688", "62510"...
## $ nip <chr> "9950036856", "6181976770", "97201...
## $ regon <chr> "317760", "251525840", "630804009"...
## $ sumaryczna_kwota_zobowiązań <chr> "8 432 922,00", "332 078,25", "416...
## $ szczegóły <chr> "Umowy", "Umowy", "Umowy", "Umowy"...
## $ link <chr> "/umowy/Agreements/GetAgreements?R...
full_df
## # A tibble: 93 × 10
## kod
## <chr>
## 1 150000016
## 2 150005039
## 3 150004658
## 4 150009135
## 5 150003546
## 6 150000066
## 7 150003556
## 8 150000073
## 9 150003539
## 10 150008909
## # ... with 83 more rows, and 9 more variables:
## # nazwa_świadczeniodawcy <chr>, miasto <chr>, ulica <chr>,
## # kod_pocztowy <chr>, nip <chr>, regon <chr>,
## # sumaryczna_kwota_zobowiązań <chr>, szczegóły <chr>, link <chr>
这篇关于rvest:使用 url 而不是文本提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!