问题描述
我正在尝试从网站上获取一些数据,但是由于我是Web抓取的新手,因此在标签名称,类代码和ID中感到困惑.我想在数据"下面复制,如果数据不存在,则该单元格应留空,并且代码需要移至下一个值.
I am trying to pull some data from web-site, but as I'm new learner in the web scraping hence confused in the Tag Name, Class Code and in ID.. I have only basic knowledge about it.I want to copy below Data and if the data is not exist then the cell should be left blank and code needs to be move in next value.
Class="container size" - 5*5,5*10 kind of value
Class="description" - Standard in this case also need to copy like Drive-up Access
Class="offer1" & "offer2" - Call for Availability
Class="price"
我试图构建代码,但无法准确判断需要选择哪个标签名称,以下是我的代码,请帮助我复制此数据.
I tried to frame a code but can't judge the exact which Tag name needs to be choose, below is my code please help me to copy this data.
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 "" & Sheets("Home").Range("C3").Text
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features")
Set list = .document.getElementsByClassName("units-table main")
'.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
Dim rowCount As Long
rowCount = .document.querySelectorAll(".units-table main li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByTagName("li")
r = r + 1
On Error Resume Next
results(r, 1) = item.getElementsByClassName("container size")(0).innerText
results(r, 2) = item.getElementsByClassName("description")(0).innerText
On Error GoTo 0
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
推荐答案
XHR:
所有信息均可通过 XMLHTTP(XHR)请求获得-比打开浏览器快得多.
All the info is available via XMLHTTP (XHR) request - much faster than opening a browser.
我首先使用.main li[class]
的css选择器检索行数."."
是类选择器,即是类型选择器,而[class]
是属性选择器.中间的空间" "
是后代组合器.那指定我要检索所有具有类属性的li
标记/类型元素,其父类的名称为main
.
I first retrieve the row count with a css selector of .main li[class]
.The "."
is a class selector, the li
is a type selector and the [class]
is an attribute selector. The space, " "
, in between is a descendant combinator. That specifies I want to retrieve all li
tag/type elements, with a class attribute, having a parent whose classname is main
.
此匹配项如下:
如您所见,这给了我行数;用于从结果集中检索信息的父li
元素的数量.
As you can see, this gives me the row count; the number of parent li
elements to retrieve info from for the result set.
返回此 li元素的集合作为querySelectorAll
的nodeList.我无法遍历此列表,将getElementsByClassName
/querySelector
应用于各个节点,因为li
元素没有提供我可以使用的方法.
This collection of li elements is returned as a nodeList by querySelectorAll
. I cannot loop over this list applying getElementsByClassName
/ querySelector
to individual nodes, as li
elements expose no methods I can use.
现在,由于我没有使用浏览器,因此我不得不依靠 HTMLDocument 对象.与浏览器不同,我无权访问受限的伪类选择器,这将允许我使用选择器语法,例如 :nth-of-type
来访问各个行.这是使用VBA进行网页抓取的烦人限制.
Now, as I am not using a browser, I am forced to rely on the methods available to HTMLDocument object. Unlike with a browser, I do not have access to the limited pseudo class selectors that they support, when automated via VBA, which would allow me to use selector syntax such as :nth-of-type
to access individual rows. This is an annoying limitation of web-scraping with VBA.
那么,我们该怎么办?好吧,在这种情况下,我可以将每个节点的innerHTML
转储到另一个HTMLDocument
变量html2
中,以便可以访问该对象的querySelector/querySelectorAll
方法.这样,HTML将仅限于当前的li
.
So, what can we do? Well, in this instance I can dump the innerHTML
of each node into another HTMLDocument
variable, html2
, so that I can access the querySelector/querySelectorAll
methods of that object. The HTML will then only be limited to the current li
.
如果我们查看有问题的HTML:
If we look at the HTML in question:
我们可以看到li
元素是一般的同级元素.他们在同一级别上彼此相邻坐着.当我循环nodeList listings
时,我正在将innerHTML
从当前节点传输到html2
;我的第二个HTMLDocument
变量.
We can see that the li
elements are general siblings. They sit next to each other at the same level. As I loop my nodeList listings
, I am transferring the innerHTML
from the current node into html2
; my second HTMLDocument
variable.
值得注意的是,我可能可以使用children
来列出每个列表:
It's worth noting I could probably have descended each listing using children
for example:
listings.item(i).Children(2)......
然后我可以在newLines上进行拆分,以便访问所有信息.我认为我的给定方法虽然更快,也更可靠.
I could then have split on newLines etc so access all info. I think my given method is faster and more robust though.
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String
Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
Set listings = html.querySelectorAll(".main li[class]")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
'size,description, amenities,specials offer1 offer2, rate type, price
results(r, 1) = Trim$(html2.querySelector(".size").innerText)
results(r, 2) = Trim$(html2.querySelector(".description").innerText)
Set icons = html2.querySelectorAll("i[title]")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next
amenities = Join$(amenitiesInfo, ", ")
results(r, 3) = amenities
results(r, 4) = html2.querySelector(".offer1").innerText
results(r, 5) = html2.querySelector(".offer2").innerText
results(r, 6) = html2.querySelector(".rate-label").innerText
results(r, 7) = html2.querySelector(".price").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Internet Explorer:
假设未从给定的URL重定向.在这里,我使用:nth-of-type伪类选择器来定位列表的每一行.这些行是li
(列表)元素,其中包含每个框列表的信息.我建立了一个css选择器字符串,该字符串指定行,然后指定我要跟随的行中的元素.我将该字符串传递给querySelector
或querySelectorAll
,它返回匹配的元素.
Assuming not re-directed from given url. Here I am using :nth-of-type pseudo class selector to target each row of the listings. Those rows are the li
(list) elements holding the info for each box listing. I build up a css selector string that specifies the row and then the element within the row I am after. I pass that string to querySelector
, or querySelectorAll
which returns matched element/s.
Option Explicit
Public Sub UseIE()
Dim ie As New InternetExplorerm, ws As Worksheet
Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 Url
While .Busy Or .readyState < 4: DoEvents: Wend
Dim headers(), results(), listings As Object, listing As Object, amenities As String
headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
Set listings = .document.querySelectorAll(".main li[class]")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
For Each listing In listings
r = r + 1
'size,description, amenities,specials offer1 offer2, rate type, price
With .document
results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)
Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next
amenities = Join$(amenitiesInfo, ",")
results(r, 3) = amenities
results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
End With
Next
.Quit
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
参考(VBE>工具>参考):
References (VBE > Tools > References):
- Microsoft HTML对象库
- Microsoft Internet控件
这篇关于通过TagName进行网页爬取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!