问题描述
我用vba刮了一些网站的乐趣,我用VBA作为工具。我使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。 Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim XMLHTTPReq As Object
Dim htmlDoc As HTMLDocument
Dim postURL As String
postURL =http://foodffs.tumblr.com/archive/ 2015/11
设置XMLHTTPReq =新的MSXML2.XMLHTTP
使用XMLHTTPReq
。打开GET,postURL,False
。发送
结束
设置htmlDoc =新的HTMLDocument
与htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
结束
i = 0
设置varTemp = htmlDoc.getElementsByClassName(post_glass post_micro_glass)
对于每个vr在varTemp
'下一行对于解决这个问题很重要问题* 1
单元格(1,1)= vr.outerHTML
设置varTemp2 = vr.getElementsByTagName(SPAN class = post_date)
单元格(i + 1,3)= varTemp2.Item(0).innerText
''下一行发生438Error''''
设置varTemp2 = vr.getElementsByClassName(hover_inner)
单元格(i + 1,4)= varTemp2.innerText
i = i + 1
下一个vr
End Sub
我通过* 1
找出了这个问题细胞(1,1)显示下一件事情
< DIV class =post_glass post_micro_glasstitle => < A class = hover title =href =http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-reallytarget = _blank>
< DIV class = hover_inner>< SPAN class = post_date> ...............
是的,所有的类标签丢失了。只有第一个函数的类有
我真的不知道为什么会发生这种情况。
//我可以通过getElementsByTagName(span )。但我更喜欢classTag .....
不被认为是自己的一种方法;只有父HTMLDocument。如果要使用它来定位DIV元素中的元素,则需要创建一个包含该特定DIV元素的.outerHtml的子HTMLDocument。
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim xmlHTTPReq As New MSXML2.XMLHTTP
Dim htmlDOC As New HTMLDocument,divSUBDOC As New HTMLDocument
Dim iDIV As Long, iSPN As Long,iEL As Long
Dim postURL As String,nr As Long,i As Long
postURL =http://foodffs.tumblr.com/archive/2015/11
与xmlHTTPReq
。打开GET,postURL,False
。发送
结束
'设置htmlDOC =新的HTMLDocument
与htmlDOC
.body.innerHTML = xmlHTTPReq.responseText
结束
i = 0
与htmlDOC
对于iDIV = 0到.getElementsByClassName(post_glass post_micro_glass)。Length - 1
nr = Sheet1.Cells(Rows.Count,3).End(xlUp).Offset(1,0 ).Row
使用.getElementsByClassName(post_glass post_micro_glass)(iDIV)
'方法1 - 在集合中运行多个
对于iSPN = 0到.getElementsByTagName(span)。长度 - 1
使用.getElementsByTagName(span)(iSPN)
选择案例LCase(.className)
案例post_date
单元格(nr,3)= .innerText
案例post_notes
单元格(nr,4)= .innerText
案例Else
'不做任何
结束选择
结束
下一步iSPN
'方法2 - 创建一个子HTML文档,以方便通过类名称获得els
divSUBDOC.body.innerHTML = .outerHTML'只有这个DIV中的HTML
W ith divSUBDOC
如果CBool(.getElementsByClassName(hover_inner)。Length)那么'至少有一个
'使用第一个
单元格(nr,5)= .getElementsByClassName(hover_inner )(0).innerText
结束如果
结束
结束
下一步iDIV
结束
End Sub
虽然其他 .getElementsByXXXX 可以轻松地检索另一个元素中的集合,需要考虑它认为是HTMLDocument作为一个整体,即使你已经愚弄了它。
I scrape some websites with vba for fun and I use VBA as tool. I use XMLHTTP and HTMLDocument (cause it's more faster than internetExplorer.Application).
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim XMLHTTPReq As Object
Dim htmlDoc As HTMLDocument
Dim postURL As String
postURL = "http://foodffs.tumblr.com/archive/2015/11"
Set XMLHTTPReq = New MSXML2.XMLHTTP
With XMLHTTPReq
.Open "GET", postURL, False
.Send
End With
Set htmlDoc = New HTMLDocument
With htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
End With
i = 0
Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")
For Each vr In varTemp
''''the next line is important to solve this issue *1
Cells(1, 1) = vr.outerHTML
Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
Cells(i + 1, 3) = varTemp2.Item(0).innerText
''''the next line occur 438Error''''
Set varTemp2 = vr.getElementsByClassName("hover_inner")
Cells(i + 1, 4) = varTemp2.innerText
i = i + 1
Next vr
End Sub
I figure out this problem by *1cells(1,1) shows me the next things
<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............
Yeah all the class tag lost " ". only the first function's class has " "I really don't know why this situation occur.
//Well I could pharse by getElementsByTagName("span"). but I prefer "class" Tag.....
The getElementsByClassName method is not considered a method of itself; only of the parent HTMLDocument. If you want to use it to locate elements within a DIV element, you need to create a sub-HTMLDocument comprised of the .outerHtml of that specific DIV element.
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim xmlHTTPReq As New MSXML2.XMLHTTP
Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
Dim iDIV As Long, iSPN As Long, iEL As Long
Dim postURL As String, nr As Long, i As Long
postURL = "http://foodffs.tumblr.com/archive/2015/11"
With xmlHTTPReq
.Open "GET", postURL, False
.Send
End With
'Set htmlDOC = New HTMLDocument
With htmlDOC
.body.innerHTML = xmlHTTPReq.responseText
End With
i = 0
With htmlDOC
For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
'method 1 - run through multiples in a collection
For iSPN = 0 To .getElementsByTagName("span").Length - 1
With .getElementsByTagName("span")(iSPN)
Select Case LCase(.className)
Case "post_date"
Cells(nr, 3) = .innerText
Case "post_notes"
Cells(nr, 4) = .innerText
Case Else
'do nothing
End Select
End With
Next iSPN
'method 2 - create a sub-HTML doc to facilitate getting els by classname
divSUBDOC.body.innerHTML = .outerHTML 'only the HTML from this DIV
With divSUBDOC
If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
'use the first
Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
End If
End With
End With
Next iDIV
End With
End Sub
While other .getElementsByXXXX can readily retrieve collections within another element, the getElementsByClassName method needs to consider what it believes to be the HTMLDocument as a whole, even if you have fooled it into thinking that.
这篇关于vba,getElementsByClassName,HTMLSource的双引号都没有了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!