一举抓住它们:You should cat(as.character(pg))
to see how ugly the HTML is. It's nested tables, but not in a good way. The entries you see there are all <tr>
elements with no <table>
breaks. Thankfully? there are only singular <td>
elements in each of those <tr>
elements. So, we can grab them all in one fell swoop by targeting the correct <table>
:
rows <- html_nodes(pg, "table[width='300'] > tr > td")
rows
## {xml_nodeset (60)}
## [1] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [2] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">6938 NORTH TELEGRAPH ROAD</font></td>
## [3] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI 48127</font></td>
## [4] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 792-9134</font></td>
## [5] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=6938+NORTH+TELEGRAPH+R ...
## [6] <td width="300" height="6"></td>
## [7] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Advance Auto Parts</b></font></p ...
## [8] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8120 North Telegraph Road</font></td>
## [9] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI 48127</font></td>
## [10] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 528-4920</font></td>
## [11] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8120+North+Telegraph+R ...
## [12] <td width="300" height="6"></td>
## [13] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Pep Boys</b></font></p></td>
## [14] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8955 TELEGRAPH RD</font></td>
## [15] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Redford, MI 48239</font></td>
## [16] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 532-5750</font></td>
## [17] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8955+TELEGRAPH+RD+Redf ...
## [18] <td width="300" height="6"></td>
## [19] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [20] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">27207 PLYMOUTH ROAD</font></td>
## ...
有许多方法可以用来从混乱中制作数据框.一个简单的方法涉及使用商店标题具有设置背景颜色而其他名称没有的事实.这使得代码有点脆弱,但我们可以通过测试背景颜色的存在来帮助它降低脆弱性.为什么我们甚至需要这样做?好吧,我们需要标记记录的开始和结束,一个简单的方法是使用我们可以 cumsum()
一个逻辑向量,知道它 FALSE
== 0. 为什么这很重要?我们可以通过这种方式创建一个隐式分组列:
There are many approaches one could take to make a data frame out of that mess. One simple one involves using the fact that the store titles have a set background color while the others do not. This makes the code a bit fragile, but we can help it be less fragile by just testing for the presence of a background color. Why do we even need to do this? Well, we need to mark start and end of records and one easy way to do this is use the fact that we can cumsum()
a logical vector, knowing that it FALSE
== 0. Why does that matter? We can create an implicit grouping column that way:
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) -> xdf
#3 # A tibble: 60 x 2
#3 record text
#3 <int> <chr>
#3 1 1 "O\u0092REILLY AUTO PARTS"
#3 2 1 6938 NORTH TELEGRAPH ROAD
#3 3 1 Dearborn Heights, MI 48127
#3 4 1 (313) 792-9134
#3 5 1 0 miles away
#3 6 1
#3 7 2 Advance Auto Parts
#3 8 2 8120 North Telegraph Road
#3 9 2 Dearborn Heights, MI 48127
#3 10 2 (313) 528-4920
#3 # ... with 50 more rows
现在,我们需要使用 filter()
删除空行,并进行一些调整以将数据转换为合适的形式来制作数据框.这是超级脆弱的代码,因为这个特定的代码段可以处理丢失的电话号码数据,但仅此而已.如果有第二个地址行,您需要修改此方法或使用不同的方法:
Now, we need to remove the empty rows with filter()
and do some munging to get the data into a decent form for making a data frame. This is super fragile code in that this particular snippet can handle missing phone number data but that's about it. If there's a second address line, you'll need to modify this approach or use a different approach:
filter(xdf, text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
## # A tibble: 10 x 5
## record store address1 city_state_zip phone_and_or_distance
## * <int> <chr> <chr> <chr> <chr>
## 1 1 "O\u0092REILLY AUTO PARTS" 6938 NORTH TELEGRAPH ROAD Dearborn Heights, MI 48127 (313) 792-9134|0 miles away
## 2 2 Advance Auto Parts 8120 North Telegraph Road Dearborn Heights, MI 48127 (313) 528-4920|0 miles away
## 3 3 Pep Boys 8955 TELEGRAPH RD Redford, MI 48239 (313) 532-5750|2 miles away
## 4 4 "O\u0092REILLY AUTO PARTS" 27207 PLYMOUTH ROAD Redford, MI 48239 (313) 937-1787|2 miles away
## 5 5 "O\u0092REILLY AUTO PARTS" 14975 TELEGRAPH ROAD Redford, MI 48239 (313) 538-3584|2 miles away
## 6 6 AutoZone 24250 FIVE MILE Redford, MI 48239 (313) 527-6877|2 miles away
## 7 7 "O\u0092REILLY AUTO PARTS" 5940 MIDDLEBELT RD Garden City, MI 48135 (734) 525-1607|3 miles away
## 8 8 AutoZone 6228 MIDDLEBELT RD Garden City, MI 48135 (734) 513-2233|3 miles away
## 9 9 Advance Auto Parts 3845 S Telegraph Rd Dearborn, MI 48124 (313) 274-6549|3 miles away
## 10 10 "O\u0092REILLY AUTO PARTS" 27565 MICHIGAN AVENUE Inkster, MI 48141 (313) 724-8544|3 miles away
以防万一过程不明显,我们:
Just in case the process was non-obvious, we:
- 按我们新创建的
record
列对行进行分组 - 将所有的文本打成一个字符串,每个部分用
|
的 分隔- 分离出所有单独的位
这应该有助于解释脆弱性.
That shld hopefully help explain the fragility.
当然,您只想要如何访问内容"部分,但希望这可以为您节省更多时间.
Granted, you only wanted the "how to get to the content" part, but hopefully this saved you some more time.
这篇关于如何使用 rvest 和 R 抓取 CGI-Bin?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
05-28 02:50