问题描述
我有一个文件,我想从中提取日期,它是一个HTML源文件,所以它的代码和短语我不需要。我需要提取包含在特定HTML标签中的日期的每个实例:abbr title =((这是我需要的文本))data-utime =
最简单的方法是什么?
如果您使用Excel VBA,请在参考菜单中为MSHTML库(标题为 Microsoft HTML对象库
)设置引用(工具 - 引用))
Sub ScrapeDateAbbr()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String,lFile As Long
Dim sHtml As String
'在文件中读取
lFile = FreeFile
sFile =C:/ Users / dick /文件/我的Dropbox / Excel / Testabbr.html
打开sFile输入为lFile
sHtml =输入$(LOF(lFile),lFile)
'放入htmldocument对象
设置hDoc =新的MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
'循环通过abbr标签
对于每个hElem在hDoc.getElementsByTagName(abbr)
'只有那些具有data-utime属性
如果Len(hElem。 getAttribute(data-utime))> 0然后
'获取标题属性
Debug.Print hElem.getAttribute(title)
如果
结束hElem
End Sub
我假定该文件是本地的,因为您调用了源文件。如果您需要先下载它,则需要另外参考MSXML和此代码
Sub ScrapeDateAbbrDownload()
Dim xHttp As MSXML2.XMLHTTP
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
设置xHttp =新建MSXML2.XMLHTTP
xHttp.OpenGET,file:/// C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html
xHttp.send
Do
DoEvents
循环直到xHttp.readyState = 4
'放入一个htmldocument对象
设置hDoc =新的MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp .responseText
'循环通过abbr标签
对于每个hElem在hDoc.getElementsByTagName(abbr)
'只有那些具有data-utime属性
如果Len(hElem.getAttribute(data-utime))> 0然后
'获取标题属性
Debug.Print hElem.getAttribute(title)
如果
结束hElem
End Sub
I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag:
abbr title="((this is the text I need))" data-utime="
What's the easiest way to achieve this?
If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library
in the reference menu)
Sub ScrapeDateAbbr()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String, lFile As Long
Dim sHtml As String
'read in the file
lFile = FreeFile
sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
Open sFile For Input As lFile
sHtml = Input$(LOF(lFile), lFile)
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
I assumed the file was local since you called in a source file. If you need to download it first, you'd need another reference to MSXML and this code
Sub ScrapeDateAbbrDownload()
Dim xHttp As MSXML2.XMLHTTP
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Set xHttp = New MSXML2.XMLHTTP
xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
xHttp.send
Do
DoEvents
Loop Until xHttp.readyState = 4
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
这篇关于在HTML标签中从文件中抓取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!