本文介绍了PHP:强调Unicode的字符和变音符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的网站上,某些Mac用户在将文本从PDF文件复制粘贴到TextArea(由TinyMCE处理)时遇到麻烦.所有突出的字符都已损坏,例如对于é来说是 e?,对于î来说是 i?等等.我无法在Windows计算机上重现此问题.

In our website, some Mac users have troubles when they copy-paste text from PDF files into a TextArea (handled by TinyMCE). All accentuated char are corrupted, and became for example e? for a é, i? for a î, etc. I cannot reproduce this problem with a Windows computer.

当我将TextArea的内容写到文件上(在将其插入数据库之前)时,我发现初始的é在视觉上不同于传统的é(在Vim上,请参见下文).

When I wrote the content of the TextArea on a file (before inserting it in the database), I just discovered that the initial is visually different that a traditionnal é (on Vim, see below).

确实:

// the corrupted é - first line of the screenshot
echo bin2hex($char); // display 65cc81

// traditionnal é
echo bin2hex('é');   // display c3a9

经过大量搜索后,我在这里:似乎Mac OS将Unicode强调字符作为两个字符的组合来复制:在我们的示例中,为 e + ́ .到目前为止,我没有找到任何解决方案可以用真正的解决方案替换损坏的é,从而避免数据库中出现 e?.

After searching a lot, here I am :It seems that Mac OS copies Unicode accentuated chars as a combination of two chars: in our example, e + ́. So far, I didn't find any solution to replace corrupted é with the real one, to avoid e? in the database.

我有点绝望.

推荐答案

将表示标准化为一个的过程形式或其他形式被称为规范化.在PHP中,有一个 Normalizer ,通过它发送所有输入是一个好主意:

The process of normalizing the representation to one form or the other is called, well, normalization. In PHP there's the Normalizer class for that, sending all input through it is a good idea:

$input = Normalizer::normalize($input);

您可能希望规范化为C,然后是规范分解,然后是规范组合.

You likely want to normalize to form C, Canonical Decomposition followed by Canonical Composition.

如果该类在您的系统上不可用,则有一个 Patchwork UTF-8库.

Should that class not be available on your system, there's the Patchwork UTF-8 library.

这篇关于PHP:强调Unicode的字符和变音符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 02:46