问题描述
我正在尝试从下面列出的网址中抓取链接和点击次数.我可以使用 xPath 抓取点击",但抓取链接"时遇到问题:这些数据是NA".可以请任何人解释这个以及如何解决它吗?这是我的脚本
I'm trying to scrape links and clicks from the url listed below. I'm able to scrape "clicks" using xPath but I have issue while scraping "links": these data are "NA". Could please anyone explain this and how to fix it? Here's my script
library(RSelenium)
library(XML)
remDr <- remoteDriver(remoteServerAddr= "192.168.99.100", port = 4445L)
remDr$open()
remDr$navigate("http://bit.d o")
logbutton <- remDr$findElement("css selector", "#top_login_info a:nth-child(1)")
logbutton$clickElement()
user <- remDr$findElement('css selector', '#login_user_username')
pass <- remDr$findElement('css selector', '#login_user_password')
user$sendKeysToElement(list('test0001'))
pass$sendKeysToElement(list('qwerty1234'))
logb <- remDr$findElement('css selector', '.btn-primary')
logb$clickElement()
remDr$navigate('http://bit.d o/admin/url/http%3A%7C%7C2F%7C%7C2Fedition.cnn.com%7C%7C2F2017%7C%7C2F07%7C%7C2F21%7C%7C2Fopinions%7C%7C2Ftrump-russia-putin-lain-opinion%7C%7C2Findex.html')
html <- htmlParse(remDr$getPageSource()[[1]])
clicks = xpathSApply(html,'//td//span[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]')
links = xpathSApply(html, '//td//br+//a')
重要提示:由于如此限制,我不得不在域名中的D"和O"之间放置一个空格
IMPORTANT: I HAD TO PUT A SPACE BETWEEN "D" AND "O" IN DOMAIN NAME DUE TO A SO RESTRICTION
推荐答案
您的链接的 XPATH 似乎不正确.我使用了 selector gadget 并提取了以下链接(不确定您对哪个感兴趣,因此 xpaths for短(bit.do/...)和长(cnn.com./...)链接如下:
It seems that you have an incorrect XPATH for links. I used selector gadget and extracted the following for the links (wasn't sure which you are interested in, so xpaths for both short (bit.do/...) and long (cnn.com./...) links are below:
short_links <- xpathSApply(html, '//td//a[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]')
long_links <- xpathSApply(html, '//span[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]')
顺便说一下,请注意您在问题中提供的凭据(登录名和密码).你得到答案后,我会尽快删除它们.
By the way, be careful with the credentials (login and password) you have provided in the question. I would delete them shortly after you got your answer.
这篇关于为什么 R 不能抓取这些链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!