问题描述
我当前的项目包括从HTML源代码中检索数据.具体来说,我正在此网站上查看崩溃案例:
My current project consists of retrieving data from HTML source code.Specifically, I am looking at crash cases on this website:
https://crashviewer .nhtsa.dot.gov/nass-cds/CaseForm.aspx?xsl = main.xsl& CaseID = 112007272
我想通过查找特定标签/ID的.innertext
来从HTML收集所有相关数据.
I would like to gather all relevant data from the HTML by looking for .innertext
of specific tags/IDs.
到目前为止,我的代码:
My code so far:
Sub ExtractData()
mystart:
'First I create two Internet Explorer object
Set objIE = CreateObject("InternetExplorer.Application") 'this browser contains the list of cases
objIE.Top = 0
objIE.Left = 0
objIE.Width = 1600
objIE.Height = 900
objIE.Visible = True 'We can see IE
Set objIEdata = CreateObject("InternetExplorer.Application") 'this browser opens the specific case
objIEdata.Top = 0
objIEdata.Left = 0
objIEdata.Width = 1600
objIEdata.Height = 900
objIEdata.Visible = True 'We can see IE
On Error Resume Next
objIE.navigate ("https://crashviewer.nhtsa.dot.gov/LegacyCDS/Index") 'url of website
Do
DoEvents
If Err.Number <> 0 Then
objIE.Quit
Set objIE = Nothing
GoTo mystart:
End If
Loop Until objIE.readystate = 4
'we define an object variable Alllinks and loop through all the links to search for
Set aAlllinks = objIE.document.getElementsByTagName("button") 'looks for Search Button
For Each Hyperlink In aAlllinks
If Hyperlink.innertext = " Search" Then
Hyperlink.Click
Exit For
Else
MsgBox "Search Button was not found. Please improve code!"
End If
Next
Application.Wait (Now + TimeValue("0:00:02"))
Set bAlllinks = objIE.document.getElementsByTagName("a") 'all Hyperlinks on webpage start with Tag "a"
For Each Hyperlink In bAlllinks
If UBound(Split(Hyperlink.innertext, "-")) = 2 And Len(Hyperlink.innertext) = 11 Then 'case specific to find the Hyperlinks which contain cases
Debug.Print Hyperlink.innertext
'2nd IE is used for each case
restart:
objIEdata.navigate (Hyperlink.href) 'url of each case
Do
DoEvents
If Err.Number <> 0 Then
objIEdata.Quit
Set objIE = Nothing
GoTo restart:
End If
Loop Until objIEdata.readystate = 4
Set register = objIEdata.document.getElementByTagName("tbody") 'objIEdata.document.getElementByID("main").getElementByID("mainSection") '.getElementByID("bodyMain").getElementsByTagName("tbody")
For Each untermenue In register
Debug.Print untermenue.innerHTML
Next
Application.Wait (Now + TimeValue("0:00:02"))
End If
Next
objIE.Quit
objIEdata.Quit
End Sub
请注意,IE的可见性仅出于调试原因.
Note that the visibility of IE is just for debugging reasons.
让我感到困惑的部分是
Set register = objIEdata.document.getElementByTagName("tbody")
.
如果我查找.TagName("tbody")
,则变量寄存器将返回空,如果我查找.ID("bodyMain")
,也会发生相同的情况.不幸的是,我不熟悉HTML以及VBA如何与HTML文档进行交互.我的印象是,如果碰巧有一个元素,我可以按它们的ID处理所有元素,但这似乎行不通.
If I look for .TagName("tbody")
the variable register is returned empty and the same happens if I look for .ID("bodyMain")
. Unfortunately, I am not familiar with HTML and how VBA interacts with a HTML document. I was under the impression that I could address all elements by their ID, if they happen to have one, but this does not seem to work.
我是否需要遍历HTML分支",还是应该找到每个ID(无论在哪个子"中找到代码)?
Do I need to work myself through the HTML "branches" or should the code be able to find each ID, no matter the "child" it is to be found in?
非常感谢
推荐答案
您要问的是一个很大的请求,因此我将提供一些指针和起始代码.我的代码应该写出所有表,但是您将需要尝试获得所需的格式.围绕有效选择元素肯定有足够的逻辑可以对此有所帮助. *我还没有测试过使用类遍历所有检索到的id的时间限制,但是已经测试了个别情况和所有id的检索.
What you are asking is a pretty big request so I am going to give some pointers and starting code. My code should write out all the tables but you will want to play around to get your desired format. There is certainly enough logic around selecting elements efficiently that this should help. * I haven't tested the use of the class to loop over all retrieved ids due time contraints but have tested the individual case and the retrieval of all ids.
要获取初始案例链接和ID:
我可能会使用一个函数来返回包含链接和ID的数组.如果您提取ID,则可以将它们传递给XMLHTTP请求,如下所示.
I might use a function returning an array containing the links and ids. If you extract the ids they can be passed the XMLHTTP request I show below.
URL为 https://crashviewer.nhtsa.dot.gov/LegacyCDS/Search
Public Function GetLinksAndIds(ByVal URL) As Variant
Dim ie As InternetExplorer, i As Long
Set ie = New InternetExplorer
With ie
.Visible = True
.navigate2 URL
While .Busy Or .readyState < 4: DoEvents: Wend
.document.getElementById("btnSubmit1").Click
While .Busy Or .readyState < 4: DoEvents: Wend
Dim caseLinks As Object, id As String, newURL As String
Set caseLinks = .document.querySelectorAll("[href*='CaseID=']")
Dim linksAndIds()
ReDim linksAndIds(1 To caseLinks.Length, 1 To 2)
For i = 0 To caseLinks.Length - 1
linksAndIds(i + 1, 1) = caseLinks.item(i)
linksAndIds(i + 1, 2) = Replace$(caseLinks.item(i), "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?xsl=main.xsl&CaseID=", vbNullString)
Next
.Quit
End With
GetLinksAndIds = linksAndIds
End Function
示例返回值:
对于每种情况-使用XMLHTTP:
我很想避免IE,并使用 XMLHTTP
请求( url编码的查询字符串,使用print选项返回更具可读性的页面版本).尽管我使用css选择器进行了解析,但是您可以将响应读入MSXML2.DOMDocument60
并使用XPath
查询.您可以将caseid连接到URL.
I would be tempted to avoid IE and use XMLHTTP
request (url encoded query string returning more readable page version using the print option). Although I have parsed using css selectors you can read the response into an MSXML2.DOMDocument60
and query with XPath
for example. You can concatenate caseid into URL.
Option Explicit
Public Sub GetTables()
Dim sResponse As String, html As HTMLDocument, clipboard As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?ViewPage&xsl=Case.xsl&tab=Crash&form=CaseForm&baseNode=&vehnum=-1&occnum=-1&pos=-1&pos2=-1&websrc=true&title=Crash%20Overview%20-%20Summary&caseid=112007272&year=&fullimage=false", False '<==concatenate caseid into URL
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = .responseText
End With
Set html = New HTMLDocument
html.body.innerHTML = sResponse
Dim tables As Object, i As Long
Set tables = html.querySelectorAll("table")
For i = 0 To tables.Length - 1
clipboard.SetText tables.item(i).outerHTML
clipboard.PutInClipboard
ws.Cells(LastRow(ws) + 2, 1).PasteSpecial
Next
End Sub
'https://www.rondebruin.nl/win/s9/win005.htm '<< Function below modified from here
Public Function LastRow(ByVal sh As Worksheet) As Long
On Error Resume Next
LastRow = sh.Cells.Find(What:="*", _
After:=sh.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row
On Error GoTo 0
End Function
使用类来保存xmlhttp对象,这看起来完全是(未经测试):
clsHTTP类:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal URL As String) As String
Dim sResponse As String
With http
.Open "GET", URL, False
.send
sResponse = .responseText
End With
End Function
标准模块1:
Option Explicit
Public Sub GetTables()
Dim sResponse As String, html As HTMLDocument, clipboard As Object, ws As Worksheet
Dim initialLinksURL As String, http As clsHTTP, i As Long, j As Long, newURL As String
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
Set html = New HTMLDocument
initialLinksURL = "https://crashviewer.nhtsa.dot.gov/LegacyCDS/Search"
Dim linksAndIds()
linksAndIds = GetLinksAndIds(initialLinksURL)
For i = LBound(linksAndIds, 2) To UBound(linksAndIds, 2)
newURL = "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?ViewPage&xsl=Case.xsl&tab=Crash&form=CaseForm&baseNode=&vehnum=-1&occnum=-1&pos=-1&pos2=-1&websrc=true&title=Crash%20Overview%20-%20Summary&caseid=" & linksAndIds(i, 2) & "&year=&fullimage=false"
html.body.innerHTML = http.GetString(newURL)
Dim tables As Object
Set tables = html.querySelectorAll("table")
For j = 0 To tables.Length - 1
clipboard.SetText tables.item(j).outerHTML
clipboard.PutInClipboard
ws.Cells(LastRow(ws) + 2, 1).PasteSpecial
Next
Next
End Sub
'https://www.rondebruin.nl/win/s9/win005.htm
Public Function LastRow(ByVal sh As Worksheet) As Long
On Error Resume Next
LastRow = sh.Cells.Find(What:="*", _
After:=sh.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row
On Error GoTo 0
End Function
Public Function GetLinksAndIds(ByVal URL) As Variant
Dim ie As InternetExplorer, i As Long
Set ie = New InternetExplorer
With ie
.Visible = True
.navigate URL
While .Busy Or .readyState < 4: DoEvents: Wend
.document.getElementById("btnSubmit1").Click
While .Busy Or .readyState < 4: DoEvents: Wend
Dim caseLinks As Object, id As String, newURL As String
Set caseLinks = .document.querySelectorAll("[href*='CaseID=']")
Dim linksAndIds()
ReDim linksAndIds(1 To caseLinks.Length, 1 To 2)
For i = 0 To caseLinks.Length - 1
linksAndIds(i + 1, 1) = caseLinks.item(i)
linksAndIds(i + 1, 2) = Replace$(caseLinks.item(i), "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?xsl=main.xsl&CaseID=", vbNullString)
Next
.Quit
End With
GetLinksAndIds = linksAndIds
End Function
所有Internet Explorer选项:
All Internet Explorer option:
Option Explicit
Public Sub GetTables()
Dim sResponse As String, html As HTMLDocument, clipboard As Object, ws As Worksheet
Dim initialLinksURL As String, i As Long, j As Long, newURL As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
Set html = New HTMLDocument
initialLinksURL = "https://crashviewer.nhtsa.dot.gov/LegacyCDS/Search"
Dim ie As InternetExplorer, caseLinks As Object
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 initialLinksURL
While .Busy Or .readyState < 4: DoEvents: Wend
.document.getElementById("btnSubmit1").Click
While .Busy Or .readyState < 4: DoEvents: Wend
Set caseLinks = .document.querySelectorAll("[href*='CaseID=']")
Dim linksAndIds()
ReDim linksAndIds(1 To caseLinks.Length, 1 To 2)
For i = 0 To caseLinks.Length - 1
linksAndIds(i + 1, 1) = caseLinks.item(i)
linksAndIds(i + 1, 2) = Replace$(caseLinks.item(i), "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?xsl=main.xsl&CaseID=", vbNullString)
Next
For i = LBound(linksAndIds, 2) To 2 ' UBound(linksAndIds, 2)
newURL = "https://crashviewer.nhtsa.dot.gov/nass-cds/CaseForm.aspx?ViewPage&xsl=Case.xsl&tab=Crash&form=CaseForm&baseNode=&vehnum=-1&occnum=-1&pos=-1&pos2=-1&websrc=true&title=Crash%20Overview%20-%20Summary&caseid=" & linksAndIds(i, 2) & "&year=&fullimage=false"
.Navigate2 newURL
While .Busy Or .readyState < 4: DoEvents: Wend
Dim tables As Object
Set tables = .document.querySelectorAll("table")
For j = 0 To tables.Length - 1
clipboard.SetText tables.item(j).outerHTML
clipboard.PutInClipboard
ws.Cells(LastRow(ws) + 2, 1).PasteSpecial
Next
Next
.Quit
End With
End Sub
'https://www.rondebruin.nl/win/s9/win005.htm
Public Function LastRow(ByVal sh As Worksheet) As Long
On Error Resume Next
LastRow = sh.Cells.Find(What:="*", _
After:=sh.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row
On Error GoTo 0
End Function
这篇关于VBA无法使用.getElementsByTag()或.getElementByID()从HTML获取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!