问题描述
我正在尝试运行一些抓取,其中我对节点采取的操作取决于节点的内容.
I'm trying to run some scraping where the action I take on a node is conditional on the contents of the node.
这应该是一个最小的例子:
This should be a minimal example:
XML =
'<td class="id-tag">
<span title="Really Long Text">Really L...</span>
</td>
<td class="id-tag">Short</td>'
page = read_html(XML)
基本上,如果 存在,我想提取
html_attr(x, "title")
,否则只获取 html_text(x)
代码>.
Basically, I want to extract html_attr(x, "title")
if <span>
exists, otherwise just get html_text(x)
.
首先要做的代码是:
page %>% html_nodes(xpath = '//td[@class="id-tag"]/span') %>% html_attr("title")
# [1] "Really Long Text"
做第二个的代码是:
page %>% html_nodes(xpath = '//td[@class="id-tag"]') %>% html_text
# [1] "\n Really L...\n" "Short"
真正的问题是 html_attr
方法没有给我任何 NA
或类似的东西对于不匹配的节点(即使我让 xpath
只是 '//td[@class="id-tag"]'
首先确保我已经缩小到只有相关的节点.这会破坏order -- 我无法自动判断原始结构在第一个节点还是第二个节点处有 Really Long Text"
.
The real problem is that the html_attr
approach doesn't give me any NA
or something similar for the nodes that don't match (even if I let the xpath
just be '//td[@class="id-tag"]'
first to be sure I've narrowed down to only the relevant nodes first. This destroys the order -- I can't tell automatically whether the original structure had "Really Long Text"
at the first or the second node.
(我想过做join,但是缩写文本和全文之间的映射不是一对一/可逆的).
(I thought of doing a join, but the mapping between the abbreviated text and the full text is not one-to-one/invertible).
这个好像在右边path -- xpath
中的 if/else 结构 -- 但不起作用.
This seems to be on the right path -- an if/else construction within the xpath
-- but doesn't work.
理想情况下我会得到输出:
Ideally I'd get the output:
# [1] "Really Long Text" "Short"
推荐答案
基于 R 使用管道运算符 %>% 时的条件评估,您可以执行类似
Based on R Conditional evaluation when using the pipe operator %>%, you can do something like
page %>%
html_nodes(xpath='//td[@class="id-tag"]') %>%
{ifelse(is.na(html_node(.,xpath="span")),
html_text(.),
{html_node(.,xpath="span") %>% html_attr("title")}
)}
我认为丢弃管道并保存沿途创建的一些对象可能很简单
I think it is possibly simple to discard the pipe and save some of the objects created along the way
nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)
xpath 唯一的方法可能是
An xpath only approach might be
page %>%
html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
html_text()
这篇关于在刮痧中相当于哪个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!