本文介绍了麻烦获得正则表达式的工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用正则表达式从文本文件中删除某些编码块。到目前为止,我的大多数正则表达式都致力于删除代码。然而,我有两个问题:
$ b $ 1)每当我删除一段文本时,文本应该被替换为空格,而不是简单地被删除。
我的regex代码的一个例子是:

  $ file =〜s /< ul(。*)> ; // GI; 

删除基本格式< ul ...>的所有行; ,这是我想要它做的。然而,正如前面提到的那样,它用空格替换了标签和所有包含的数据,我想知道如何停止这个特定的替换。
$ b 2)某些正则表达式代码应该工作,似乎没有。例如,我想删除

 < script type =text / javascript> 

函数getCookies(){return; }

< / script>

我尝试过使用各种正则表达式代码,但似乎没有删除这些行。例如:

  $ file =〜s /< script type(。*)< \ / script> // GI; 

删除< script type ...> < / script> 标签,但是会留下

  function getCookies(){return; } 

...保持不变。我不确定为什么发生这种情况,我非常想纠正这种情况。这将如何成为可能?对这两个问题的任何帮助都将非常有帮助!

编辑:对不起,我正在使用Perl!
另外:我只是试过使用

  $ file =〜/<脚本类型(。*)< \\ \\ / script> / sgi 

...以及/ msgi ,但都不幸工作。 < script type> < / script> 标签都已被移除,但由于某些原因, p>

  function getCookies(){return; } 

...部分保留。这里是我的整个代码,包括所有的正则表达式:

  use strict; 
使用警告;

my $ firstarg;
if($ ARGV [0]){
$ firstarg = $ ARGV [0];
}

open(DATA,$ ARGV [1]);
my $ file = do {local $ /; <数据>};

$ file =〜s /< \!DOCTYPE(。*)> // gi;
$ file =〜s /< html> // gi;
$ file =〜s /< \ / html> // gi;
$ file =〜s /< title> // gi;
$ file =〜s /< \ / title> // gi;
$ file =〜s /< head> // gi;
$ file =〜s /< \ / head> // gi;
$ file =〜s /< link(。*)> // gi;
$ file =〜s /< \ link> // gi;
$ file =〜s / CDM(。*)\; // gi;
$ file =〜s /< \!(。*) - > // gi;
$ file =〜s /< body(。*)> // gi;
$ file =〜s /< \ / body> // gi;
$ file =〜s /< div(。*)> // gi;
$ file =〜s /< \ / div> // gi;
$ file =〜s / function(。*)> // gi;
$ file =〜s /< noscript> // gi;
$ file =〜s /< \ / noscript> // gi;
$ file =〜s /< a(。*)> // gi;
$ file =〜s /< \ / a> // gi;
$ file =〜s /< ul(。*)> // gi;
$ file =〜s /< \ / ul> // gi;
$ file =〜s /< li(。*)> // gi;
$ file =〜s /< \ / li> // gi;
$ file =〜s /< form(。*)> // gi;
$ file =〜s /< \ / form> // gi;
$ file =〜s /< iframe(。*)> // gi;
$ file =〜s /< \ / iframe> // gi;
$ file =〜s /< select(。*)> // gi;
$ file =〜s /< \ / select> // gi;
$ file =〜s /< textarea(。*)> // gi;
$ file =〜s /< \ / textarea> // gi;
$ file =〜s /< b> // gi;
$ file =〜s /< \ / b> // gi;
$ file =〜s /< H1> // gi;
$ file =〜s /< H2> // gi;
$ file =〜s /< H3> // gi;
$ file =〜s /< H4> // gi;
$ file =〜s /< H5> // gi;
$ file =〜s /< H6> // gi;
$ file =〜s /< \ / H1> // gi;
$ file =〜s /< \ / H2> // gi;
$ file =〜s /< \ / H3> // gi;
$ file =〜s /< \ / H4> // gi;
$ file =〜s /< \ / H5> // gi;
$ file =〜s /< \ / H6> // gi;
$ file =〜s /< option(。*)> // gi;
$ file =〜s /< \ / option> // gi;
$ file =〜s /< p> // gi;
$ file =〜s /< \ / p> // gi;
$ file =〜s /< span(。*)> // gi;
$ file =〜s /< \ / span> // gi;
$ file =〜s /<!doctype(。*)> // gi;
$ file =〜s /< base(。*)> // gi;
$ file =〜s /< br> // gi;
$ file =〜s /< hr> // gi;
$ file =〜s /< img(。*)> // gi;
$ file =〜s /< input(。*)> // gi;
$ file =〜s /< link(。*)> // gi;
$ file =〜s /< meta(。*)> // gi;
$ file =〜s /< script type(。*)< \ / script> // gi;
打印$ file;

好的,现在我已经删除了< script> 正则表达式导致一个问题,另一个已经创建 - 使用:

  $ file =〜s /<脚本的类型<(。*); \ /脚本> // GI; 

会移除< script ...>的第一个实例; ,但不是标签本身,而不是整个标签的重复。使用:

  $ file =〜s /<脚本类型(。*)< \ / script> // mgi ; 

的结果完全相同。使用:

  $ file =〜s /<脚本类型(。*)< \ / script> // sgi ; 

会导致打印几个换行符,但是没有其他文本,对于 / MSGI
urgh,这些问题永远不会结束...:($ / b>

新编辑:我想为使用正则表达式解析HTML的问题发表道歉。在编程社区中对这种做法存在相当大的反对(或者在实践中尝试,因为这似乎经常失败)但是,不幸的是,我不得不使用正则表达式来解析选定的 如果不是所有的HTML标签,我都不能使用模块,尽管这是最明显也是最简单的答案。

解决方案

如果您不允许使用Perl正则表达式以外的任何内容,那么您可以修改:

 #! / usr / bin / perl -w 
使用strict;
使用警告;

$ _ = do {local $ /;< DATA>};

#请参阅http: //www.perlmonks.org/?node_id=161281
#算法:
#find< ,
#comment<! - ... - > ;,
#或评论< ;? ...?> ,
#或其中一个需要对应的开始标签
#结束标签加全部结束标签
#或者如果\ s或=
#则跳到下一个
#else [^>]
#>
s {
< #打开标记
(?:#打开群组(A)
(! - )|#comment(1)或
(\?)|#another comment(2)or
(?i:#开放组(B)为/ i
(#开始标记之一
SCRIPT |#必须跳过
APPLET |#
OBJECT |#全部内容
STYLE#对应
)#结束标记(3)
)|#关闭组(B)或
([!/ A-Za-z ])#这些字符之一,记得在(4)
)#close group(A)
(?(4)#如果前面的情况是(4)
(?:#open (C)
(?!##下一个不是:(D)
[\s =]#\ s或=
[`']#with open引用
)#close(D)
[^>] |#并且不能关闭标记或
[\s =]#\ s或=with
`[^`] *`|#引用内容或
[\s =]#\ s或=with
'[^'] *'| #引用中的某些内容或
[\s =]#\s或=with
[^] *#引用内容
)*#repeat C)0次或更多次
| #else(如果前面的情况不是(4))
。*? #最小字符数
)#如果前一个字符是(4)
(?(1)#if comment(1)
(?)#end if comment(1)
(?(2)#if another comment(2)
(?(?(3)#如果其中一个标签 - 容器(3)
< /#等待结束
(?i :\3)#这个标记
(?:\s [^>] *)?#跳过垃圾到>
)#end if(3)
> #标记已关闭
} {} gsx; #STRIP这个标签

print;

__END__
< html>< title>删除脚本,ul< / title>
< script type =text / javascript>

函数getCookies(){return; }

< / script>
< body>
< ul>< li> 1
< li> 2
< p>段落



输出



 删除脚本,ul 


1
2
段落

注意:这个正则表达式不适用于嵌套标签容器例如:

 <!DOCTYPE html> 
< meta charset =UTF-8>
< title>嵌套& lt; object>例如< /标题>
< body>
< object data =uri:here> uri的后备内容:here
< object data =uri:another> uri:另一个后备
< / object> ; !!!这个文本应该也是条纹!
< / object>



输出



 嵌套& lt; object>例如

!!!这个文本应该也是条纹的!






使用html解析器或建立的工具例如,:

 #!/ usr / bin / perl -w 
use strict;
使用警告;

使用HTML :: Parser();
$ b $ HTML :: Parser-> new(
ignore_elements => [script],
ignore_tags => [ul],
default_h => [sub {print shift},'text'],
) - > parse_file(\ * DATA)或者error:$!\\\
;

__END__
< html>< title>删除脚本,ul< / title>
< script type =text / javascript>

函数getCookies(){return; }

< / script>
< body>
< ul>< li> 1
< li> 2
< p>段落



输出



 < html>< title>删除脚本,ul< / title> ; 

< body>
< li> 1
< li> 2
< p>段落


I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:

1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed.An example of my regex code is:

$file =~ s/<ul(.*)>//gi;

Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.

2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove

<script type="text/javascript">

function getCookies() { return ""; }

</script>

I have tried using various regex codes, but nothing seems to remove these lines. For instance:

$file =~ s/<script type(.*)<\/script>//gi;

Which removes the <script type...> and </script> tags respectively, but leaves the

function getCookies() { return ""; }

...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!

Edit: Sorry all, I'm using Perl!Also: I just tried using

$file =~ /<script type(.*)<\/script>/sgi

...as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the

function getCookies() { return ""; }

...section stayed. Here is my entire code, including all regex:

use strict;
use warnings;

my $firstarg;
if ($ARGV[0]){
  $firstarg = $ARGV[0];
}

open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;

Ok, now that I deleted the <script> regex that was causing one problem, another has been created - using:

$file =~ s/<script type(.*)<\/script>//gi;

removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:

$file =~ s/<script type(.*)<\/script>//mgi;

results in the exact same thing. Using:

$file =~ s/<script type(.*)<\/script>//sgi;

results in the printing of several new line characters, but no other text, same for /msgi.Urgh, the problems never end... :(

NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.

解决方案

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:

#!/usr/bin/perl -w
use strict;
use warnings;

$_ = do { local $/; <DATA> };

# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
s{
  <               # open tag
  (?:             # open group (A)
    (!--) |       #   comment (1) or
    (\?) |        #   another comment (2) or
    (?i:          #   open group (B) for /i
      (           #     one of start tags
        SCRIPT |  #     for which
        APPLET |  #     must be skipped
        OBJECT |  #     all content
        STYLE     #     to correspond
      )           #     end tag (3)
    ) |           #   close group (B), or
    ([!/A-Za-z])  #   one of these chars, remember in (4)
  )               # close group (A)
  (?(4)           # if previous case is (4)
    (?:           #   open group (C)
      (?!         #     and next is not : (D)
        [\s=]     #       \s or "="
        ["`']     #       with open quotes
      )           #     close (D)
      [^>] |      #     and not close tag or
      [\s=]       #     \s or "=" with
      `[^`]*` |   #     something in quotes ` or
      [\s=]       #     \s or "=" with
      '[^']*' |   #     something in quotes ' or
      [\s=]       #     \s or "=" with
      "[^"]*"     #     something in quotes "
    )*            #   repeat (C) 0 or more times
  |               # else (if previous case is not (4))
    .*?           #   minimum of any chars
  )               # end if previous char is (4)
  (?(1)           # if comment (1)
    (?<=--)       #   wait for "--"
  )               # end if comment (1)
  (?(2)           # if another comment (2)
    (?<=\?)       #   wait for "?"
  )               # end if another comment (2)
  (?(3)           # if one of tags-containers (3)
    </            #   wait for end
    (?i:\3)       #   of this tag
    (?:\s[^>]*)?  #   skip junk to ">"
  )               # end if (3)
  >               # tag closed
 }{}gsx;         # STRIP THIS TAG

print;

__END__
<html><title>remove script, ul</title>
<script type="text/javascript">

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

remove script, ul


1
2
paragraph

NOTE: This regex doesn't work for nested tag-containers e.g.:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested &lt;object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
  <object data="uri:another">uri:another fallback
  </object>!!!this text should be striped too!!!
</object>

Output

Nested &lt;object> example

!!!this text should be striped too!!!


Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:

#!/usr/bin/perl -w
use strict;
use warnings;

use HTML::Parser ();

HTML::Parser->new(
    ignore_elements => ["script"],
    ignore_tags => ["ul"],
    default_h => [ sub { print shift }, 'text'],
    )->parse_file(\*DATA) or die "error: $!\n";

__END__
<html><title>remove script, ul</title>
<script type="text/javascript">

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

<html><title>remove script, ul</title>

<body>
<li>1
<li>2
<p>paragraph

这篇关于麻烦获得正则表达式的工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 02:07
查看更多