I'm trying to parse a .HTML file which I have converted to a .txt file inside of automator.
我previously使用的Automator下载从网站.html文件,我现在在努力分析源$ C $ C。
I previously downloaded the .HTML file from a website using automator, and am now struggling to parse the source code.
Preferably, I want to take the information of just the table. and I need to repeat this action for 1800 different .HTML files.
下面是源$ C $ C的例子:
Here is an example of the source code:
<div id="header">
<div class="wrapper">
<span class="access">
<div id="fb-root"></div>
<span class="access">
Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward | <a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>
</div><!-- /wrapper -->
</div><!-- /header -->
<div id="masthead">
<div class="wrapper">
<a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
<div id="navigation">
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>
</div><!-- /navigation -->
</div><!-- /wrapper -->
</div><!-- /masthead -->
<div id="content">
<div class="wrapper">
<div id="main-content">
<!-- per Project stuff -->
<span class="section">
<img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
<h1><span id="profile-name-104947" >Christian Sieling</span></h1>
<ul class="gbutton-group right">
<li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li>
<li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
<div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
<span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
<a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
<h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
<p>Other</p> </td>
<th>Organisation Type</th>
<p>Asset Manager</p> </td>
<td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td>
<td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
<td>41 78 616 7334</td>
<th>Mailing Address</th>
<td>Birrenstrasse 30</td>
<th class="lastrow" >Zip/ Postal Code</th>
<td class="lastrow" >8834</td>
</div><!-- /main-content -->
<div id="sidebar" >
<div id="similar_sidebar" class="similar_refine" >
</div><!-- /wrapper -->
</div><!-- /content -->
<div id="footer">
My applescript attempt: I was attempting to use delimiters to extract the table in a similar fashion:
set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween
Any help would be amazing!
You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
So I modified your code to look for that tag in 2 steps. First...
And then this separately...
In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...
set p to input
set ex to extractBetween(p, "<table", ">", "</table>")
to extractBetween(SearchText, startText1, startText2, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText1
set endItems to text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text item 1 of endItems
set AppleScript's text item delimiters to startText2
set finalText to (text items 2 thru -1 of beginningToEnd) as text
set AppleScript's text item delimiters to tid
return finalText
end extractBetween