想法: 我看不到get请求带有任何标头或Cookies,可以解释不同的回应. 据我了解,read_html和read_html(content(GET(.),"text"))都将返回XML/html. 好吧,在这里我不确定是否应该检查,但是因为我用完了主意:我检查了是否存在某种缓存.代码:with_verbose(GET("https://nzffdms.niwa.co.nz/search"))....<- Expires: Thu, 19 Nov 1981 08:52:00 GMT<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0->在我看来,缓存可能不是解决方案.查看help("GET")给出了有关条件GET"的有趣部分: 如果GET方法的语义更改为条件GET", 请求消息包括If-Modified-Since,If-Unmodified-Since, If-Match,If-None-Match或If-Range标头字段.有条件的GET 方法要求仅在 条件标头字段描述的情况.这 有条件的GET方法旨在减少不必要的网络使用 通过允许刷新缓存的实体而无需多个 请求或传输客户端已经拥有的数据.但是据我对with_verbose()的了解,没有设置If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range.解决方案区别在于,重复调用httr::GET,句柄在两次调用之间仍然存在.使用xml2::read_html(),每次都建立一个新的连接.来自httr文档: 句柄池用于为相同的方案/主机/端口组合自动重用Curl句柄.这样可以确保http会话自动重用,并且Cookie可以在对站点的所有请求中得到维护,而无需用户干预.在xml2文档中,讨论了传递给read_html()的字符串参数: 字符串可以是路径,URL或文字xml.使用base::url或(如果已安装)curl::curl ,可以将Urls转换为连接所以您的答案是read_html(GET(url))就像刷新浏览器,但是read_html(url)就像关闭浏览器并打开新的浏览器.服务器在其交付的页面上提供唯一的会话ID.新会话,新ID.您可以通过调用httr::reset_handle(url)来证明这一点:library(httr)library(xml2)# GET the page (note xml2 handles httr responses directly, don't need content("text"))gr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))# A new GET using the same handle gets exactly the same responsegr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))# Now call GET again after resetting the handlehttr::handle_reset("https://nzffdms.niwa.co.nz/search")gr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))就我而言,采购上述代码可以使我:[1] "ecd9be7c75559364a2a5568049c0313f"[1] "ecd9be7c75559364a2a5568049c0313f"[1] "d953ce7acc985adbf25eceb89841c713"I am looking at this great answer: https://stackoverflow.com/a/58211397/3502164.The beginning of the solution includes:library(httr)library(xml2)gr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(content(gr, "text"))xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")Output is constant across multiple requests:"59243d3a2....61f8f73136118f9"My Default way so far would have been:doc <- read_html("https://nzffdms.niwa.co.nz/search")xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")That results differs to the Output above and changes across multiple requests.Question:What is the difference in between:read_html(url)read_html(content(GET(url), "text"))Why does it result in different values and why does only the "GET" solution Returns the csv in the linked question?(I hope its ok to structure it in Kind of three Sub Questions).What i tried:Going down the Rabbit hole of function calls:read_html(ms <- methods("read_html"))getAnywhere(ms[1])xml2:::read_htmlxml2:::read_html.default#xml2:::read_html.responseread_xml(ms <- methods("read_xml"))getAnywhere(ms[1])But that resulted in this Question: Find the used method for R wrapper functionsThoughts:I dont see that the get request takes any headers or Cookies, thatcould explain different Responses.From my understanding both read_html and read_html(content(GET(.),"text")) will return XML/html.Ok, here i am not sure if it makes sense to check, but because i ran out of ideas: I checked if there is some Kind of Caching going on.Code:with_verbose(GET("https://nzffdms.niwa.co.nz/search"))....<- Expires: Thu, 19 Nov 1981 08:52:00 GMT<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0--> Does not look to me like Caching might be the solution.Looking at help("GET") gives an interesting section concerning a "conditional GET": The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.But as far as i see with with_verbose() None of If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range are set. 解决方案 The difference is that with repeated calls to httr::GET, the handle persists between calls. With xml2::read_html(), a new connection is made each time.From the httr documentation: The handle pool is used to automatically reuse Curl handles for the same scheme/host/port combination. This ensures that the http session is automatically reused, and cookies are maintained across requests to a site without user intervention.From the xml2 documentation, discussing the string parameter that is passed to read_html(): A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curlSo your answer is read_html(GET(url)) is like refreshing your browser, but read_html(url) is like closing your browser and opening a new one. The server gives a unique session ID on the page it delivers. New session, new ID. You can prove this by calling httr::reset_handle(url):library(httr)library(xml2)# GET the page (note xml2 handles httr responses directly, don't need content("text"))gr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))# A new GET using the same handle gets exactly the same responsegr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))# Now call GET again after resetting the handlehttr::handle_reset("https://nzffdms.niwa.co.nz/search")gr <- GET("https://nzffdms.niwa.co.nz/search")doc <- read_html(gr)print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))In my case, sourcing the above code gives me:[1] "ecd9be7c75559364a2a5568049c0313f"[1] "ecd9be7c75559364a2a5568049c0313f"[1] "d953ce7acc985adbf25eceb89841c713" 这篇关于read_html(url)和read_html(content(GET(url),"text"))之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 13:02
查看更多