php - 字符串损坏或preg_match错误？

NO-BREAK SPACE和许多其他UTF-8符号need 2 bytes to its representation;因此，在假定的UTF8字符串上下文中，非ASCII的孤立字节(不带xC2)(> 127)是无法识别的字符...好吧，这只是一个布局问题(!)，但它破坏了整个字符串？

如何避免这种“意外行为”？ (它occurs in some functions and not in others)。

示例(仅使用preg_match生成意外行为):

  header("Content-Type: text/plain; charset=utf-8"); // same if text/html
  //PHP Version 5.5.4-1+debphp.org~precise+1
  //using a .php file enconded as UTF8.

  $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // empty! (corrupted)
  $m=str_word_count($s,1);
  var_dump($m);            // ok

  $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // ok!
  $m=str_word_count($s,1);
  var_dump($m);            // ok

最佳答案

这不是一个完整的答案，因为我没有说为什么某些PHP函数“完全对无效编码的字符串完全失败”，而另一些则不然:请参见@deceze(有关问题的注释)和@hakre答案。
如果您正在寻找str_word_count()的PCRE替代品，请参见下面的我的preg_word_count()。

PS:关于“PHP5的内置库行为统一性”的讨论，我的结论是PHP5并没有那么糟糕，但是我们创建了许多用户定义的wrap(façade)函数(请参阅PHP-framworks的多样性!)。 ..或等待PHP6 :-)

谢谢@pebbl!如果我了解您的链接，there are a lack of error messagens on PHP。因此，我发现的问题的一种可能的解决方法是添加一个错误条件...我找到了the condition here(它确保有效的utf8!)...并且感谢@deceze记住存在一个用于检查此条件的内置函数(我编辑了之后的代码)。

将问题放在一起，将解决方案转换为函数(已编辑，感谢@hakre注释!)，

 function my_word_count($s,$triggError=true) {
   if ( preg_match_all('/[-\'\p{L}]+/u',$s,$m) !== false )
      return count($m[0]);
   else {
      if ($triggError) trigger_error(
         // not need mb_check_encoding($s,'UTF-8'), see hakre's answer,
         // so, I wrong, there are no 'misteious error' with preg functions
         (preg_last_error()==PREG_BAD_UTF8_ERROR)?
              'non-UTF8 input!': 'other error',
         E_USER_NOTICE
         );
      return NULL;
   }
 }

现在(考虑@hakre答案后编辑)，关于统一行为:我们可以使用PCRE库开发一个合理的函数来模仿str_word_count行为，接受不良的UTF8。为此，我使用了@bobince iconv tip:

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8)
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

演示(try other inputs!):

 $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);

 $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);