如何处理无效UTF-8字符的用户输入？

本文介绍了如何处理无效UTF-8字符的用户输入？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一般关于如何处理用户的无效UTF-8输入的策略/建议。

I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.

即使我的webapp使用UTF-8，某些用户输入无效字符。这会导致PHP的中的错误，整体来说似乎是一个坏主意。

Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

说：如果收到非UTF-8数据，则应该发回错误信息。

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

应该在几个不同地方可以输入数据的地方实际上做到这一点吗？

如何以有用的方式向用户呈现错误？

如何临时存储和显示不良表单数据，以便用户不会丢失所有文本？剥坏的人物？使用替换字符，如何？

对于数据库中的现有数据，当检测到无效的UTF-8数据时，是否应尝试转换并保存（如何？（）？？），或者原样保留在数据库中，但在json_encode（）？之前做某事（什么？）？

How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

编辑：我非常熟悉mbstring扩展名，而不是问UTF-8如何在PHP中工作。我想向在现实世界中有经验的人们提供建议。他们如何处理这个问题。

EDIT2：作为解决方案的一部分，我真的很想看到一个快速的方法来将无效字符转换为U + FFFD

As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD

推荐答案

accept-charset =UTF-8属性只是浏览器遵循的指导，他们不会被迫以这种方式提交，表单提交机器人是一个很好的例子...

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...

我通常做的是忽略坏字符，通过或使用不太可靠的 / 功能，如果您使用 iconv 您还可以选择音译b广告字符

What I usually do is ignore bad chars, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions, if you use iconv you also have the option to transliterate bad chars.

以下是使用 iconv（）的示例：

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

如果要向用户显示错误消息，我可能会以全球方式而不是按照每个值接收到的基础，这样的事情可能会很好：

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

您可能还想对新行进行规范化，

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

从UTF-8转换为Unicode码点：

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

可能比任何其他替代品都要快，

Probably faster than any other alternative, haven't tested it extensively though.

示例：

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

这是你要找的吗？

这篇关于如何处理无效UTF-8字符的用户输入？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！