如何仅解析HTML文件的一部分而忽略其余部分

如何仅解析HTML文件的一部分而忽略其余部分

本文介绍了如何仅解析HTML文件的一部分而忽略其余部分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在每5,000个HTML文件中,我只需要获得一行文本,即999行.如何告诉HTML :: Parser我只需要获得999行?

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

</p><h1>dataset 1:</h1>

&nbsp;<table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>
<td><strong>name:</strong>&nbsp;</td>  <td width=500> myname one         </td></tr><tr>
<td><strong>type:</strong>&nbsp;</td>  <td width=500>       type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;73430&nbsp;Madrid</td></tr><tr>
<td><strong>adresse_two:</strong>&nbsp;</td>  <td>          no_value        </td></tr><tr>
<td><strong>telefone:</strong>&nbsp;</td>  <td>         0000736111/680040        </td></tr><tr>
<td><strong>Fax:</strong>&nbsp;</td>  <td>          0000736111/680040        </td></tr><tr>
<td><strong>E-Mail:</strong>&nbsp;</td>  <td>       Keine Angabe        </td></tr><tr>
<td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysite.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong>the office:</strong>&nbsp;</td>
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr>
<td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr>
<td><strong>officer:</strong>&nbsp;</td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong>&nbsp;</td>  <td> 259        </td></tr><tr>
<td><strong>offices:</strong>&nbsp;</td>  <td>     8        </td></tr><tr>
<td><strong>worker:</strong>&nbsp;</td>  <td>     no_value        </td></tr><tr>
<td><strong>country:</strong>&nbsp;</td>  <td>    contryname        </td></tr><tr>
<td><strong>the_council:</strong>&nbsp;</td>  <td>

问题是,是否可以在具有以下属性的5000个文件中进行搜索:感兴趣的第999行.换句话说,我可以告诉HTML解析器它必须准确地查看(并提取)第999行吗?

Well, the question is, is it possible to do the search in the 5000 files with this attribute: That the line 999 is of interest. In other words, can I tell the HTML-parser that it has to look (and extract) exactly line 999?

您好,亲爱的RedGritty Brick-我对HTML :: TokeParser几乎没有经验

Hello dear RedGritty Brick - i have little experience with HTML :: TokeParser

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

顺便说一句; RedGrittyBrick::请参见示例网站之一: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF = http://www.kultus-bw.de/did_abfrage/detail.php?id = 04313488 在带有阴影的灰色块中,您会看到所需的信息:所需的17行.注意-我有5000个不同的HTML文件-所有文件的结构都相同!

BTW; RedGrittyBrick: See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

这意味着我很高兴有一个可以使用HTML :: TokeParser :: Simple和DBI运行的模板.

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

喜欢获得提示

推荐答案

您是指第999行还是表格的第999行?

Do you mean the 999th line or the 999th table row?

前者可能是

perl -ne 'print if $. == 999' /path/to/*.dat

后者将包含HTML解析器和一些选择逻辑. Sax解析器可能更适合快速处理大量文件.这可能取决于所使用的HTML版本以及它是否格式正确".

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl有许多XML和HTML解析器-您是否有任何特定的模块在心中?

Perl has many XML and HTML parsers - did you have any particular module in mind?

您的问题似乎是您的XPath表达式.实际的HTML比您的XPath建议.以下表达式效果更好

Your problem seems to be your XPath expression. The actual HTML is much more complex thanyour XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

尝试将以上内容剪切并粘贴到文件中,然后使用Perl运行它.

Try cutting the above and pasting into a file then running it with Perl.

这篇关于如何仅解析HTML文件的一部分而忽略其余部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-31 04:11