无法使用来自 r 的 rvest 包读取带有 read_html 的网页

本文介绍了无法使用来自 r 的 rvest 包读取带有 read_html 的网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从亚马逊抓取产品评论者的位置.比如这个网页

I'm trying to scrape the location of product reviewers from amazon. For example, this webpage

[https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8][1]

我需要去美国伊利诺伊州海恩斯维尔

我使用 rvest 包进行网页抓取.

I use rvest package for webscraping.

这是我所做的:

library(rvest)       
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
page = read_html(url)

我收到如下错误:

Error in open.connection(x, "rb") : HTTP error 403.

但是，以下有效:

con <- url(url, "rb")
page = read_html(con)

但是，在我阅读的页面中，我无法提取任何文本.比如我想提取reviewer的位置.

However, with the page I read, I could not extract any text. For example, I want to extract the location of the reviewer.

page %>%
    html_nodes("#customer-profile-name-header .a-size-base a-color-base")%>%
    html_text()

我什么都没有

character(0)

谁能帮我弄清楚我做错了什么?非常感谢.

Can anyone help figure what I did wrong? Thanks a lot in advance.

推荐答案

这应该有效:

library(dplyr)
library(rvest)
library(stringr)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'

# open page
con <- url(url, "rb")
page = read_html(con)

# get the desired information, using View Page Source
page %>%
  html_nodes(xpath=".//script[contains(., 'occupation')]")%>%
  html_text() %>% as.character() %>% str_match(.,"location\":\"(.*?)\",\"personalDescription") -> res

res[,2]

这篇关于无法使用来自 r 的 rvest 包读取带有 read_html 的网页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！