使用正则表达式和 CodeIgniter 缩小最终的 HTML 输出

本文介绍了使用正则表达式和 CodeIgniter 缩小最终的 HTML 输出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Google 页面建议您缩小 HTML，即删除所有不必要的空格.CodeIgniter 确实具有 giziping 输出的功能，或者可以通过 .htaccess 完成.但我仍然想从最终的 HTML 输出中删除不必要的空格.

我用这段代码玩了一下，它似乎可以工作.这确实会导致 HTML 没有多余的空格并删除其他选项卡格式.

class Welcome 扩展 CI_Controller{函数_输出(){echo preg_replace('!s+!', ' ', $output);}函数索引(){...}}

问题是可能有这样的标签

、 等等，其中可能有空格，正则表达式应该删除它们.那么，如何从最终的 HTML 中删除多余的空间，而不影响使用正则表达式的这些特定标签的空格或格式?
感谢@Alan Moore 得到了答案，这对我有用
echo preg_replace('#(?ix)(?>[^S ]s*|s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)))*+)(?:<(?>textarea|pre)|z))#', ' ', $输出);
ridgerunner 在分析这个正则表达式方面做得非常好.我最终使用了他的解决方案.为 ridgerunner 干杯.
 解决方案 
对于那些对 Alan Moore 的正则表达式如何工作感到好奇的人(是的，它确实有效)，我冒昧地发表了评论它可以让凡人阅读:
function process_data_alan($text)//{$re = '%# 折叠 ws 无处不在，但在黑名单元素中.(?> # 匹配除单个空格之外的所有白跨度.[^S ]s* # 一个 [
fv] 和零个或多个 ws,|s{2,} # 或两个或多个连续的任意空格.) # 注意:剩余的正则表达式根本不消耗文本...(?= # 确保我们不在黑名单标签中.(?: # 开始(不必要的)组.(?: # 零个或多个...[^<]++ # 一个或多个非<"|<# 或一个 <开始一个非黑名单标签.(?!/?(?:textarea|pre)))*+ # (这可以被展开循环"化.)) # 结束(不必要的)组.(?: # 开始交替组.<# 要么是黑名单开始标签.(?>textarea|pre)|z # 或文件结尾.) # 结束交替组.) # 如果我们在这里做了，我们就没有在黑名单标签中.%ix';$text = preg_replace($re, " ", $text);返回 $text;}
我是新来的，但我马上就能看出 Alan 非常擅长正则表达式.我只会添加以下建议.
存在可以删除的不必要的捕获组.
虽然OP没有这么说，元素应该添加到和代码>黑名单.
添加 'S' PCREstudy"修饰符可将该正则表达式的速度提高约 20%.
前瞻中有一个交替组，可以应用 Friedl 的展开循环"效率结构.
更严肃地说，这个相同的交替组:(即 (?:[^<]++|<(?!/?(?:textarea|pre)))*+) 容易受到大型目标字符串上过度 PCRE 递归的影响，这可能导致堆栈溢出，从而导致 Apache/PHP 可执行文件悄悄出现段错误并在没有警告的情况下崩溃.(Apache httpd.exe 的 Win32 构建特别容易受到这种影响，因为与 *nix 可执行文件相比，它只有 256KB 的堆栈，后者通常使用 8MB 或更多的堆栈构建.)Philip Hazel(作者PHP 中使用的 PCRE 正则表达式引擎)在文档中讨论了这个问题:PCRE DISCUSSION OF STACK USAGE.尽管 Alan 已正确应用了与 Philip 在本文档中展示的相同的修复程序(对第一个选项应用了所有格加号)，但如果 HTML 文件很大并且有很多未列入黑名单的标签，仍然会有很多递归.例如在我的 Win32 机器上(具有 256KB 堆栈的可执行文件)，该脚本因只有 60KB 的测试文件而崩溃.另请注意，不幸的是 PHP 没有遵循建议并将默认递归限制设置为 100000.(根据 PCRE 文档，这应该设置为等于堆栈大小除以 500 的值).
这是一个改进版本，它比原始版本更快，处理更大的输入，如果输入字符串太大而无法处理，则优雅地失败并显示消息:
//将 PCRE 递归限制设置为合理值 = STACKSIZE/500//ini_set("pcre.recursion_limit", "524");//256KB 堆栈.Win32 Apacheini_set("pcre.recursion_limit", "16777");//8MB 堆栈.*尼克斯函数 process_data_jmr1($text)//{$re = '%# 折叠所有地方的空白，但在黑名单元素中.(?> # 匹配除单个空格之外的所有白跨度.[^S ]s* # 一个 [
fv] 和零个或多个 ws,|s{2,} # 或两个或多个连续的任意空格.) # 注意:剩余的正则表达式根本不消耗文本...(?= # 确保我们不在黑名单标签中.[^<]*+ # 零个或多个非<"{普通的*}(?: # Begin {(special normal*)*} 构造<# 或一个 <开始一个非黑名单标签.(?!/?(?:textarea|pre|script))[^<]*+ # 更多非<"{普通的*})*+ # 完成展开循环"(?: # 开始交替组.<# 要么是黑名单开始标签.(?>textarea|pre|script)|z # 或文件结尾.) # 结束交替组.) # 如果我们在这里做了，我们就没有在黑名单标签中.％六';$text = preg_replace($re, " ", $text);if ($text === null) exit("PCRE 错误！文件太大.
");返回 $text;}
附言我非常熟悉这个 PHP/Apache 段错误问题，因为我参与了帮助 Drupal 社区解决这个问题的过程.请参阅:优化 CSS 选项导致 php cgi 在 pcre 函数匹配"中出现段错误.我们还在 FluxBB 论坛软件项目中使用 BBCode 解析器体验了这一点.
希望这会有所帮助.
Google pages suggest you to minify HTML, that is, remove all the unnecessary spaces.CodeIgniter does have the feature of giziping output or it can be done via .htaccess.But still I also would like to remove unnecessary spaces from the final HTML output as well.
I played a bit with this piece of code to do it, and it seems to work.This does indeed result in HTML that is without excess spaces and removes other tab formatting.
class Welcome extends CI_Controller
{
    function _output()
    {
        echo preg_replace('!s+!', ' ', $output);
    }

    function index(){
    ...
    }
}
The problem is there may be tags like<pre>,<textarea>, etc.. which may have spaces in them and a regular expression should remove them.So, how do I remove excess space from the final HTML, without effecting spaces or formatting for these certain tags using a regular expression?
Thanks to @Alan Moore got the answer, this worked for me
echo preg_replace('#(?ix)(?>[^S ]s*|s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)))*+)(?:<(?>textarea|pre)|z))#', ' ', $output);
ridgerunner did a very good job of analyzing this regular expression. I ended up using his solution. Cheers to ridgerunner.
 解决方案 
For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:
function process_data_alan($text) //
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^S ]s*     # Either one [
fv] and zero or more ws,
        | s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre))
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)
          | z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}
I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions. 
There is an unnecessary capture group which can be removed.
Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist. 
Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).
Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:
// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) //
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^S ]s*     # Either one [
fv] and zero or more ws,
        | s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script))
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)
          | z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.
");
    return $text;
}
p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.
Hope this helps.
                        这篇关于使用正则表达式和 CodeIgniter 缩小最终的 HTML 输出的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！