本文介绍了使用 Xpath 从 HTML 代码中提取注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出以下 HTML 代码片段的注释中所写的内容,这只是该代码的一部分:

I'm trying to get out what's written in comment of following HTML code snippet, this is only a part of that code:

<table id="datalist1" cellspacing="0" border="0" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<tr>
    <td style="font-size:7pt;">
                                            <table width="100%" border="0" cellspacing="0" cellpadding="0">
                                                <tr align="left">
                                                    <td width="50%" class="subhead1">
                                                        <!-- <b>IE CODE : 0514026049</b> --> ' I want text inside this comment

                                                    </td>
                                                    <td rowspan="9" valign="top">
                                                        <span id="datalist1_ctl00_lbl_p"></span>
                                                    </td>
                                                </tr>

我正在尝试以下方法

1) 获取元素的 Xpath.

1) Get Xpath of element.

2) 阅读网页

3) 转到评论节点

4) 提取评论中的文本

4) extract text in comment

  library(rvest)
  library(xml2)

  url <- 'http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z'
  webpage <- read_html(url)
    ' Xpath of comment element I want to grab
    //*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()



  webpage %>%
      html_nodes(xpath='//*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()')%>%html_text()
#character(0) ' this is output

但是上面的代码给出了一个空字符串.由于我从未使用过 Xpath,我不明白这是否是正确的方法.

But the above code gives out an empty character string. Since I have never used Xpath, I don't understand if this is even correct way to go about it.

我必须为所有评论元素运行这个.我想简而言之我的问题是如何提取 HTML 代码中的注释?

I'll have to run this for all comment elements.I guess in short my question is How to extract comments in HTML code ?

推荐答案

library(rvest)
library(tidyverse)

pg <- read_html("http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z")

html_nodes(pg, xpath=".//comment()[contains(., 'IE CODE')]/../../..") %>% # target the comment then back up to the table
  map_df(~{

    # extract the <td> (column 1)
    html_nodes(.x, xpath=".//td[1]") %>%
      html_text(trim=TRUE) %>%
      str_replace_all("[[:space:]]+", " ") -> tmp

    # add in the comment to the "missing" <td> value
    html_node(.x, xpath=".//comment()") %>%
      html_text() %>%
      stri_replace_all_regex("<b>|</b>", "") -> tmp[1]

    # set it up for data frame-ing
    set_names(as.list(tmp), sprintf("X%s", 1:8))

  })
## # A tibble: 196 x 8
##                        X1                      X2                                                                           X3
##                     <chr>                   <chr>                                                                        <chr>
##  1  IE CODE : 0514026049           Z A M PRODUCTS                                          54 DAROOD GRAN SHAHPEER GATE MEERUT
##  2  IE CODE : AQDPV0923E                Z CONNECT             H-302, AIRFORCE NAVAL, ATHIPALAYAM PIRIVU, GANAPATHY, COIMBATORE
##  3  IE CODE : 2912000459        Z K INTERNATIONAL                           MUGHALPURA IST NEAR ISMAIL BEG KI MASJID MORADABAD
##  4  IE CODE : 0307069753  Z K R INTERNATIONAL CO.            4084, PLAZA SHOPPING CENTRE,104/142, SHERIF DEVJI STREET, MUMBAI,
##  5  IE CODE : 3117507531          Z S ENTERPRISES  SURVEY NO 12,PLOT NO.64,FLAT NO 1, KAUSARBAUGH NIBM ROAD KONDHWA KHURD PUNE
##  6  IE CODE : 0500009503               Z. EXPORTS                                 T-283, NEAR GURUDWARA BHAIJI B AHATA KIDARA,
##  7  IE CODE : 0713030658        Z. K. MANGO MANDI                              APMC YARD, RMC CHANNAPATNA, RAMANAGARA DISTRICT
##  8  IE CODE : 0599037351             Z.A. CRAFTS,                      A-56, GALI NO. 6, CHOUHAN BANGER, NEW SEELAM PUR, DELHI
##  9  IE CODE : 0609001353        Z.B.INTERNATIONAL 1ST FLOOR,25TH MILE STONE,AGRA MATHURA ROAD,VILL CHUMURA, POST-FARAH MATHURA
## 10  IE CODE : 0501009256             Z.D. EXPORTS             J-51, EXTENSION, STREET NO. 12/3, RAMESH PARK, LAXMI NAGAR DELHI
## # ... with 186 more rows, and 5 more variables: X4 <chr>, X5 <chr>, X6 <chr>, X7 <chr>, X8 <chr>

这篇关于使用 Xpath 从 HTML 代码中提取注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-02 00:11