I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.
Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
- 应该在几个不同地方可以输入数据的地方实际上做到这一点吗?
- 如何以有用的方式向用户呈现错误?
- 如何临时存储和显示不良表单数据,以便用户不会丢失所有文本?剥坏的人物?使用替换字符,如何?
- 对于数据库中的现有数据,当检测到无效的UTF-8数据时,是否应尝试转换并保存(如何?()??),或者原样保留在数据库中,但在json_encode()?之前做某事(什么?)?
- How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
- How do you present the error in a helpful way to the user?
- How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
- For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
EDIT2:作为解决方案的一部分,我真的很想看到一个快速的方法来将无效字符转换为U + FFFD
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD
accept-charset =UTF-8
The accept-charset="UTF-8"
attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...
我通常做的是忽略坏字符,通过或使用不太可靠的 / 功能,如果您使用 iconv
What I usually do is ignore bad chars, either via iconv()
or with the less reliable utf8_encode()
/ utf8_decode()
functions, if you use iconv
you also have the option to transliterate bad chars.
以下是使用 iconv()
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:
function utf8_clean($str)
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
return preg_replace('~\p{C}+~u', '', $string);
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
function Codepoint($char)
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
$result = sprintf('U+%04X', $codepoint[1]);
return $result;
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
Probably faster than any other alternative, haven't tested it extensively though.
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
$result = array();
foreach ((array) $string as $char)
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
$result[] = sprintf('U+%04X', $codepoint[1]);
return implode('', $result);