通过TagName进行网页爬取

本文介绍了通过TagName进行网页爬取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从网站上获取一些数据，但是由于我是Web抓取的新手，因此在标签名称，类代码和ID中感到困惑.我想在数据"下面复制，如果数据不存在，则该单元格应留空，并且代码需要移至下一个值.

I am trying to pull some data from web-site, but as I'm new learner in the web scraping hence confused in the Tag Name, Class Code and in ID.. I have only basic knowledge about it.I want to copy below Data and if the data is not exist then the cell should be left blank and code needs to be move in next value.

Class="container size" - 5*5,5*10 kind of value
Class="description" - Standard in this case also need to copy like Drive-up Access
Class="offer1" & "offer2" - Call for Availability
Class="price"

我试图构建代码，但无法准确判断需要选择哪个标签名称，以下是我的代码，请帮助我复制此数据.

I tried to frame a code but can't judge the exact which Tag name needs to be choose, below is my code please help me to copy this data.

Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
    .Visible = True
    .Navigate2 "" & Sheets("Home").Range("C3").Text

    While .Busy Or .readyState < 4: DoEvents: Wend

    Sheets("Unit Data").Select


    Dim listings As Object, listing As Object, headers(), results()
    Dim r As Long, list As Object, item As Object
    headers = Array("size", "features")
    Set list = .document.getElementsByClassName("units-table main")
    '.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
    Dim rowCount As Long
    rowCount = .document.querySelectorAll(".units-table main li").Length


    ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
    For Each listing In list
        For Each item In listing.getElementsByTagName("li")
            r = r + 1
            On Error Resume Next
            results(r, 1) = item.getElementsByClassName("container size")(0).innerText
            results(r, 2) = item.getElementsByClassName("description")(0).innerText

            On Error GoTo 0


        Next
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    .Quit
End With

推荐答案

XHR:

所有信息均可通过 XMLHTTP(XHR)请求获得-比打开浏览器快得多.

All the info is available via XMLHTTP (XHR) request - much faster than opening a browser.

我首先使用.main li[class]的css选择器检索行数."."是类选择器，即是类型选择器，而[class]是属性选择器.中间的空间" "是后代组合器.那指定我要检索所有具有类属性的li标记/类型元素，其父类的名称为main.

I first retrieve the row count with a css selector of .main li[class].The "." is a class selector, the li is a type selector and the [class] is an attribute selector. The space, " ", in between is a descendant combinator. That specifies I want to retrieve all li tag/type elements, with a class attribute, having a parent whose classname is main.

此匹配项如下:

如您所见，这给了我行数；用于从结果集中检索信息的父li元素的数量.

As you can see, this gives me the row count; the number of parent li elements to retrieve info from for the result set.

返回此 li元素的集合作为querySelectorAll的nodeList.我无法遍历此列表，将getElementsByClassName/querySelector应用于各个节点，因为li元素没有提供我可以使用的方法.

This collection of li elements is returned as a nodeList by querySelectorAll. I cannot loop over this list applying getElementsByClassName / querySelector to individual nodes, as li elements expose no methods I can use.

现在，由于我没有使用浏览器，因此我不得不依靠 HTMLDocument 对象.与浏览器不同，我无权访问受限的伪类选择器，这将允许我使用选择器语法，例如 :nth-of-type 来访问各个行.这是使用VBA进行网页抓取的烦人限制.

Now, as I am not using a browser, I am forced to rely on the methods available to HTMLDocument object. Unlike with a browser, I do not have access to the limited pseudo class selectors that they support, when automated via VBA, which would allow me to use selector syntax such as :nth-of-type to access individual rows. This is an annoying limitation of web-scraping with VBA.

那么，我们该怎么办?好吧，在这种情况下，我可以将每个节点的innerHTML转储到另一个HTMLDocument变量html2中，以便可以访问该对象的querySelector/querySelectorAll方法.这样，HTML将仅限于当前的li.

So, what can we do? Well, in this instance I can dump the innerHTML of each node into another HTMLDocument variable, html2, so that I can access the querySelector/querySelectorAll methods of that object. The HTML will then only be limited to the current li.

如果我们查看有问题的HTML:

If we look at the HTML in question:

我们可以看到li元素是一般的同级元素.他们在同一级别上彼此相邻坐着.当我循环nodeList listings时，我正在将innerHTML从当前节点传输到html2；我的第二个HTMLDocument变量.

We can see that the li elements are general siblings. They sit next to each other at the same level. As I loop my nodeList listings, I am transferring the innerHTML from the current node into html2; my second HTMLDocument variable.

值得注意的是，我可能可以使用children来列出每个列表:

It's worth noting I could probably have descended each listing using children for example:

listings.item(i).Children(2)......

然后我可以在newLines上进行拆分，以便访问所有信息.我认为我的给定方法虽然更快，也更可靠.

I could then have split on newLines etc so access all info. I think my given method is faster and more robust though.

VBA:

Option Explicit
Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String
    Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
        html.body.innerHTML = s

        Dim headers(), results(), listings As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
        Set listings = html.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1

        ReDim results(1 To rowCount, 1 To numColumns)
        Dim html2 As HTMLDocument
        Set html2 = New HTMLDocument
        For item = 0 To listings.Length - 1
            r = r + 1
            html2.body.innerHTML = listings.item(item).innerHTML
            'size,description, amenities,specials offer1 offer2, rate type, price

            results(r, 1) = Trim$(html2.querySelector(".size").innerText)
            results(r, 2) = Trim$(html2.querySelector(".description").innerText)
            Set icons = html2.querySelectorAll("i[title]")

            ReDim amenitiesInfo(0 To icons.Length - 1)

            For icon = 0 To icons.Length - 1
                amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
            Next

            amenities = Join$(amenitiesInfo, ", ")

            results(r, 3) = amenities
            results(r, 4) = html2.querySelector(".offer1").innerText
            results(r, 5) = html2.querySelector(".offer2").innerText
            results(r, 6) = html2.querySelector(".rate-label").innerText
            results(r, 7) = html2.querySelector(".price").innerText
        Next

        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Internet Explorer:

假设未从给定的URL重定向.在这里，我使用:nth-of-type伪类选择器来定位列表的每一行.这些行是li(列表)元素，其中包含每个框列表的信息.我建立了一个css选择器字符串，该字符串指定行，然后指定我要跟随的行中的元素.我将该字符串传递给querySelector或querySelectorAll，它返回匹配的元素.

Assuming not re-directed from given url. Here I am using :nth-of-type pseudo class selector to target each row of the listings. Those rows are the li (list) elements holding the info for each box listing. I build up a css selector string that specifies the row and then the element within the row I am after. I pass that string to querySelector, or querySelectorAll which returns matched element/s.

Option Explicit

Public Sub UseIE()
    Dim ie As New InternetExplorerm, ws As Worksheet
    Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"

    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ie
        .Visible = True
        .Navigate2 Url

        While .Busy Or .readyState < 4: DoEvents: Wend

        Dim headers(), results(), listings As Object, listing As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")

        Set listings = .document.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1
        ReDim results(1 To rowCount, 1 To numColumns)
        For Each listing In listings
            r = r + 1
            'size,description, amenities,specials offer1 offer2, rate type, price
            With .document

                results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
                results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)

                Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")

                ReDim amenitiesInfo(0 To icons.Length - 1)

                For icon = 0 To icons.Length - 1
                    amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
                Next

                amenities = Join$(amenitiesInfo, ",")
                results(r, 3) = amenities
                results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
                results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
                results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
                results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
            End With
        Next
        .Quit
        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

参考(VBE>工具>参考):

References (VBE > Tools > References):

Microsoft HTML对象库
Microsoft Internet控件

这篇关于通过TagName进行网页爬取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！