为什么 R 不能抓取这些链接?

本文介绍了为什么 R 不能抓取这些链接?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从下面列出的网址中抓取链接和点击次数.我可以使用 xPath 抓取点击"，但抓取链接"时遇到问题:这些数据是NA".可以请任何人解释这个以及如何解决它吗?这是我的脚本

I'm trying to scrape links and clicks from the url listed below. I'm able to scrape "clicks" using xPath but I have issue while scraping "links": these data are "NA". Could please anyone explain this and how to fix it? Here's my script

library(RSelenium)
library(XML)
remDr <- remoteDriver(remoteServerAddr= "192.168.99.100", port = 4445L)
remDr$open()

remDr$navigate("http://bit.d o")
logbutton <- remDr$findElement("css selector", "#top_login_info a:nth-child(1)")
logbutton$clickElement()
user <- remDr$findElement('css selector', '#login_user_username')
pass <- remDr$findElement('css selector', '#login_user_password')
user$sendKeysToElement(list('test0001'))
pass$sendKeysToElement(list('qwerty1234'))
logb <- remDr$findElement('css selector', '.btn-primary')
logb$clickElement()
remDr$navigate('http://bit.d o/admin/url/http%3A%7C%7C2F%7C%7C2Fedition.cnn.com%7C%7C2F2017%7C%7C2F07%7C%7C2F21%7C%7C2Fopinions%7C%7C2Ftrump-russia-putin-lain-opinion%7C%7C2Findex.html')

html <- htmlParse(remDr$getPageSource()[[1]])
clicks = xpathSApply(html,'//td//span[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]')
links = xpathSApply(html, '//td//br+//a')

重要提示:由于如此限制，我不得不在域名中的D"和O"之间放置一个空格

IMPORTANT: I HAD TO PUT A SPACE BETWEEN "D" AND "O" IN DOMAIN NAME DUE TO A SO RESTRICTION

推荐答案

您的链接的 XPATH 似乎不正确.我使用了 selector gadget 并提取了以下链接(不确定您对哪个感兴趣，因此 xpaths for短(bit.do/...)和长(cnn.com./...)链接如下:

It seems that you have an incorrect XPATH for links. I used selector gadget and extracted the following for the links (wasn't sure which you are interested in, so xpaths for both short (bit.do/...) and long (cnn.com./...) links are below:

short_links <- xpathSApply(html, '//td//a[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]')
long_links <- xpathSApply(html, '//span[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]')

顺便说一下，请注意您在问题中提供的凭据(登录名和密码).你得到答案后，我会尽快删除它们.

By the way, be careful with the credentials (login and password) you have provided in the question. I would delete them shortly after you got your answer.

这篇关于为什么 R 不能抓取这些链接?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..

Gadget