问题描述
我正在设置一个新服务器,并希望在我的 Web 应用程序中完全支持 UTF-8.我过去曾在现有服务器上尝试过,但似乎总是最终不得不退回到 ISO-8859-1.
I'm setting up a new server and want to support UTF-8 fully in my web application. I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.
我到底需要在哪里设置编码/字符集?我知道我需要配置 Apache、MySQL 和 PHP 来执行此操作 - 是否有一些我可以遵循的标准清单,或者可能对发生不匹配的地方进行故障排除?
Where exactly do I need to set the encoding/charsets? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?
这是一个新的 Linux 服务器,运行 MySQL 5、PHP、5 和 Apache 2.
This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.
推荐答案
数据存储:
在数据库中的所有表和文本列上指定
utf8mb4
字符集.这使得 MySQL 在物理上存储和检索以 UTF-8 本地编码的值.请注意,如果指定了utf8mb4_*
排序规则(没有任何显式字符集),MySQL 将隐式使用utf8mb4
编码.
Specify the
utf8mb4
character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8. Note that MySQL will implicitly useutf8mb4
encoding if autf8mb4_*
collation is specified (without any explicit character set).
在旧版本的 MySQL (< 5.5.3) 中,不幸的是,您将被迫只使用 utf8
,它只支持 Unicode 字符的一个子集.我真希望我是在开玩笑.
In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply utf8
, which only supports a subset of Unicode characters. I wish I were kidding.
数据访问:
在您的应用程序代码(例如 PHP)中,无论您使用何种数据库访问方法,您都需要将连接字符集设置为
utf8mb4
.这样,当 MySQL 将数据传递给您的应用程序时,它不会从其原生 UTF-8 进行转换,反之亦然.
In your application code (e.g. PHP), in whatever DB access method you use, you'll need to set the connection charset to
utf8mb4
. This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.
一些驱动程序提供了自己的配置连接字符集的机制,它既更新自己的内部状态,又通知 MySQL 将在连接上使用的编码——这通常是首选方法.在 PHP 中:
Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach. In PHP:
如果您使用的是 PDO 抽象层PHP ≥ 5.3.6,你可以在charsetnoreferrer">DSN:
If you're using the PDO abstraction layer with PHP ≥ 5.3.6, you can specify
charset
in the DSN:
$dbh = new PDO('mysql:charset=utf8mb4');
如果您使用 mysqli,您可以调用 set_charset()
:
If you're using mysqli, you can call set_charset()
:
$mysqli->set_charset('utf8mb4'); // object oriented style
mysqli_set_charset($link, 'utf8mb4'); // procedural style
如果您坚持使用普通的 mysql 但碰巧运行 PHP ≥ 5.2.3,您可以调用 mysql_set_charset
.
If you're stuck with plain mysql but happen to be running PHP ≥ 5.2.3, you can call
mysql_set_charset
.
如果驱动程序没有提供自己的设置连接字符集的机制,您可能必须发出一个查询来告诉 MySQL 您的应用程序期望如何对连接上的数据进行编码:
SET NAMES 'utf8mb4'
.
If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded:
SET NAMES 'utf8mb4'
.
关于
utf8mb4
/utf8
的相同考虑同样适用于上述.
The same consideration regarding
utf8mb4
/utf8
applies as above.
输出:
如果您的应用程序将文本传输到其他系统,它们也需要知道字符编码.对于 Web 应用程序,必须通知浏览器发送数据的编码(通过 HTTP 响应标头或 HTML 元数据).
在 PHP 中,您可以使用
default_charset
php.ini 选项,或者自己手动发出 Content-Type
MIME 标头,这只是更多的工作但具有相同的效果.
In PHP, you can use the
default_charset
php.ini option, or manually issue the Content-Type
MIME header yourself, which is just more work but has the same effect.
使用
json_encode()
对输出进行编码时,添加 JSON_UNESCAPED_UNICODE
作为第二个参数.
When encoding the output using
json_encode()
, add JSON_UNESCAPED_UNICODE
as a second parameter.
输入:
不幸的是,在尝试将其存储或在任何地方使用之前,您应该验证每个接收到的字符串是否为有效的 UTF-8.PHP 的
mb_check_encoding()
做了技巧,但你必须虔诚地使用它.真的没有办法解决这个问题,因为恶意客户端可以以他们想要的任何编码提交数据,而且我还没有找到让 PHP 可靠地为您执行此操作的技巧.
Unfortunately, you should verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding()
does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.
根据我对当前 HTML 规范 的阅读,以下子项目是不必要的,甚至不需要对现代 HTML 不再有效.我的理解是浏览器将使用为文档指定的字符集处理和提交数据.但是,如果您的目标是旧版本的 HTML(XHTML、HTML4 等),以下几点可能仍然有用:
From my reading of the current HTML spec, the following sub-bullets are not necessary or even valid anymore for modern HTML. My understanding is that browsers will work with and submit data in the character set specified for the document. However, if you're targeting older versions of HTML (XHTML, HTML4, etc.), these points may still be useful:
仅适用于 HTML5 之前的 HTML:您希望浏览器发送给您的所有数据都采用 UTF-8.不幸的是,如果您要可靠地做到这一点的唯一方法是将
accept-charset
属性添加到您的所有<form>
标签:<form ... accept-charset="UTF-8">
.仅适用于 HTML5 之前的 HTML:请注意,W3C HTML 规范规定客户端应该"默认以服务器提供的任何字符集将表单发送回服务器,但这显然只是一个建议,因此需要在每个
<form>
标签上进行明确.
For HTML before HTML5 only: you want all data sent to you by browsers to be in UTF-8. Unfortunately, if you go by the only way to reliably do this is add the
accept-charset
attribute to all your<form>
tags:<form ... accept-charset="UTF-8">
.For HTML before HTML5 only: note that the W3C HTML spec says that clients "should" default to sending forms back to the server in whatever charset the server served, but this is apparently only a recommendation, hence the need for being explicit on every single
<form>
tag.
其他代码注意事项:
很明显,您将提供的所有文件(PHP、HTML、JavaScript 等)都应以有效的 UTF-8 编码.
Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.
您需要确保每次处理 UTF-8 字符串时都是安全的.不幸的是,这是困难的部分.您可能希望广泛使用 PHP 的
mbstring
扩展名.
You need to make sure that every time you process a UTF-8 string, you do so safely. This is, unfortunately, the hard part. You'll probably want to make extensive use of PHP's
mbstring
extension.
PHP 的内置字符串操作不是默认的 UTF-8 安全.有些事情你可以用普通的 PHP 字符串操作安全地做(比如连接),但对于大多数情况,您应该使用等效的
mbstring
函数.
PHP's built-in string operations are not by default UTF-8 safe. There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent
mbstring
function.
要知道您在做什么(阅读:不要搞砸了),您确实需要了解 UTF-8 以及它在尽可能低的级别上是如何工作的.查看 utf8.com 中的任何链接,获取一些很好的资源,以了解您需要了解的一切.>
To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.com for some good resources to learn everything you need to know.
这篇关于UTF-8 贯穿始终的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!