本文介绍了使用PHP从Unicode字符串转换Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读几个解决方案,但还没有设法得到任何工作。



我有一个JSON字符串,我读入API调用,并且它包含Unicode字符 - \\\Â\\\£ 例如是£符号。



我想使用PHP将它们转换为£& pound;



我正在调查这个问题,发现下面的代码(使用我的英镑符号来测试),但它似乎不工作:

  $ title = preg_replace(/ \\\\u([a-f0-9] {4})/ e,iconv -4LE','UTF-8',pack('V',hexdec('U $ 1'))),'\\\Â\\\£'); 

输出£ p>

我是否认为这是UTF-16编码?我如何将这些输出转换为HTML?



UPDATE



看起来API的JSON字符串有2或3个非转义的Unicode字符串,例如:

  That\\\â\\\€\\\™s b $ b \\\Â\u00a(pound symbol)


解决方案

这是 UTF-16编码。它似乎似乎是伪造的编码,因为\uXXXX编码是独立的任何UTF或UCS编码的Unicode。 \\\Â\\\£ 真的映射到£字符串。



你应该有 \\\£ 这是£的unicode代码点。 p>

{0xC2,0xA3}是此代码点的UTF-8编码的2字节字符。



我认为,将原始的UTF-8字符串编码为JSON的软件不知道这是UTF-8的事实,并盲目地将每个字节编码为转义的Unicode字节点,然后您需要将每对Unicode字符串转换为一个UTF-8编码字符,然后将其解码为本机PHP编码,使其可打印。

  function fixBadUnicode ){
return utf8_decode(preg_replace(/ \\\\u00([0-9a-f] {2})\\\\u00([0-9a-f ] {2})/ e,'chr(hexdec($ 1))。chr(hexdec($ 2)'',$ str)
}

示例:



编辑:



如果要修复字符串以获取有效的JSON字符串,您需要使用以下函数:

  function fixBadUnicodeForJson($ str){
$ str = preg_replace(/ \\\\u00([0-9a-f] {2})\\\\u00([0-9a-f] {2})\\\\u00([0-9a-f] {2})\\ \\u00([0-9a-f] {2})/ e,'chr(hexdec($ 1)).chr(hexdec($ 2))。 ).chr(hexdec($ 4))',$ str);
$ str = preg_replace(/ \\\\u00([0-9a-f] {2})\\\\u00([0-9a-f] 2})\\\\u00([0-9a-f] {2})/ e,'chr(hexdec($ 1))。chr(hexdec($ 2))。chr (hexdec($ 3))',$ str);
$ str = preg_replace(/ \\\\u00([0-9a-f] {2})\\\\u00([0-9a-f] 2})/ e,'chr(hexdec($ 1))chr(hexdec($ 2)'',$ str);
$ str = preg_replace(/ \\\\u00([0-9a-f] {2})/ e,'chr(hexdec($ 1))',$ str );
return $ str;
}

编辑2:任何错误的unicode转义utf-8字节序列到等效的utf-8字符。



请注意,这些字符可能来自一个编辑器,如Word不能翻译为ISO-8859-1,因此将在ut8_decode后显示为'?'。


I've been reading up on a few solutions but have not managed to get anything to work as yet.

I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.

I'd like to use PHP to convert these into either £ or £.

I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:

$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');

The output is £.

Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?

UPDATE

It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:

That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
解决方案

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
    return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
    return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

这篇关于使用PHP从Unicode字符串转换Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 09:39