问题描述
我有这个PHP dom网页抓取工具正常工作。它提取了提及的标签以及从(外部)论坛站点到我的页面的链接。但是最近我遇到了一个问题。
喜欢
这是论坛数据的HTML ::
< TBODY>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> - dreamer1984< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/28/17 01:42< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 200< / td>
< / tr>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> - monariyadh< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/27/17 23:12< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 108< / td>
< / tr>
< / tbody>
现在,如果我们将上述代码(表数据)视为该站点中可用的唯一语句。如果我试图用网络抓取工具提取它,例如
<?php
require_once('dom / simple_html_dom.php');
$ html = file_get_html('http://www.sitename.com/');
foreach($ html-> find('td.FootNotes2')as $ element){
echo $ element;
}
?>
它将类名称中的内部数据提取为FootNote2
现在,如果我想提取标签中的特定数据,如
,例如名为dreamer1984和monariyadh的第一个标签/行。
如果我想从第3个(跳过其余的)提取具有相同类名的数据,该怎么办。
希望我做出问题清楚了解。
任何帮助都赞赏..
我建议您使用。
这是您需要的示例
$ subject =<< EOF
< TBODY>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837880.phptarget =_ top =Links2>西班牙裔学习合作伙伴< / a> - dreamer1984< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/28/17 01:42< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 200< / td>
< / tr>
< tr>
< td width =1%height =25>& nbsp;< / td>
< td width =64%height =25class =FootNotes2>< a href =/ files / forum / 2017/1 / 837879.phptarget =_ top = Links2 > nbme< / A> - monariyadh< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =14%height =25class =FootNotes2align =center> 02/27/17 23:12< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =8%height =25align =Centerclass =FootNotes2> 0< / td>
< td width =1%height =25>& nbsp;< / td>
< td width =9%height =25align =Centerclass =FootNotes2> 108< / td>
< / tr>
< / tbody>
EOF;
preg_match_all('/< td。+?FootNotes2。+?< a。+?< \ / a> - (?P< name>。*?) /td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d { 2}:\d {2})/ siu',$ subject,$ matchs);
foreach($ matchs ['name'] as $ k => $ v){
var_dump('name:'。$ v,'relative date:'。$ matchs [ '日期'] [$ K]);
}
I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.
But recently i ran into a problem.Like
this is the HTML of the forum data::
<tbody>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach($html->find('td.FootNotes2') as $element) {
echo $element;
}
?>
It extracts al the data that is inside with a class name as "FootNote2"
Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.
and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.
Hope i made the problem clear to understand.
Any help is appreciated..
I suggest to you use regex.
this is example of what you need
$subject = <<<EOF
<tbody>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
EOF;
preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);
foreach ($matchs['name'] as $k => $v){
var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}
这篇关于使用DOM PHP网页抓取工具从论坛网站进行选择性数据提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!