我正在使用rvest提取以下页面中的表:
https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin
以下代码有效:
URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin'
table <- URL %>%
read_html %>%
html_nodes("table") %>%
.[[2]] %>%
html_table(trim=TRUE)
但是页边距和总裁姓名栏有一些奇怪的值。原因是源代码具有以下内容:
<td><span style="display:none">00.001</span>−10.44%</td>
所以我不会得到-10.44%而是00.001 –10.44%
我该如何解决?
最佳答案
一种选择是分别定位和替换问题列。
边距列可以使用xpath
定位
# get the html
html <- URL %>%
read_html()
# Example using the first margin column (column # 6)
html %>%
html_nodes(xpath = '//table[2]') %>% # get table 2
html_nodes(xpath = '//td[6]/text()') %>% # get column 6 using text()
iconv("UTF-8", "UTF-8") # to convert "−" to "-"
# [1] "−10.44%" "−3.00%" "−0.83%" "−0.51%" "0.09%" "0.17%" "0.57%"
# [8] "0.70%" "1.45%" "2.06%" "2.46%" "3.01%" "3.12%" "3.86%"
#[15] "4.31%" "4.48%" "4.79%" "5.32%" "5.56%" "6.05%" "6.12%"
#[22] "6.95%" "7.27%" "7.50%" "7.72%" "8.51%" "8.53%" "9.74%"
#[29] "9.96%" "10.08%" "10.13%" "10.85%" "11.80%" "12.20%" "12.25%"
#[36] "14.20%" "14.44%" "15.40%" "17.41%" "17.76%" "17.81%" "18.21%"
#[43] "18.83%" "22.58%" "23.15%" "24.26%" "25.22%" "26.17%"
对另一边距列执行相同的操作。我使用
iconv
将−
转换为-
,因为这是一个编码问题,但是您可以改用基于替换的解决方案(例如,使用sub
)。要使用总统姓名作为目标列,可以再次使用xpath:
html %>%
html_nodes(xpath = '//table[2]') %>%
html_nodes(xpath = '//td[3]/a/text()') %>%
html_text()
# [1] "John Quincy Adams" "Rutherford Hayes" "Benjamin Harrison"
# [4] "George W. Bush" "James Garfield" "John Kennedy"
# [7] "Grover Cleveland" "Richard Nixon" "James Polk"
#[10] "Jimmy Carter" "George W. Bush" "Grover Cleveland"
#[13] "Woodrow Wilson" "Barack Obama" "William McKinley"
#[16] "Harry Truman" "Zachary Taylor" "Ulysses Grant"
#[19] "Bill Clinton" "William Henry Harrison" "William McKinley"
#[22] "Franklin Pierce" "Barack Obama" "Franklin Roosevelt"
#[25] "George H. W. Bush" "Bill Clinton" "William Taft"
#[28] "Ronald Reagan" "Franklin Roosevelt" "Abraham Lincoln"
#[31] "Abraham Lincoln" "Dwight Eisenhower" "Ulysses Grant"
#[34] "James Buchanan" "Andrew Jackson" "Martin Van Buren"
#[37] "Woodrow Wilson" "Dwight Eisenhower" "Herbert Hoover"
#[40] "Franklin Roosevelt" "Andrew Jackson" "Ronald Reagan"
#[43] "Theodore Roosevelt" "Lyndon Johnson" "Richard Nixon"
#[46] "Franklin Roosevelt" "Calvin Coolidge" "Warren Harding"
关于r - 使用RVest刮跨度的HTML表,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35730647/