问题描述
如果我在一个json字符串中有一个unicode字符串,它不会被解析,改变,然后编码。解码。我不知道为什么,因为json.org说一个字符串可以包含: c> htmlentities code> hackaround解决方案工作。 htmlentities 需要知道输入字符串的编码是否正确工作。如果不指定,则假定ISO 8859-1。 ( html_entity_decode ,令人困惑,默认为UTF-8,因此您的方法具有从ISO 8859-1转换为UTF-8的效果。)
使用 \uXXXX 转义,但正如你所说,这是有效的JSON。
您需要配置您的连接到Postgres,以便它将给你的UTF-8字符串。 PHP手册指出,您可以通过在连接字符串中附加 options =' - client_encoding = UTF8'来实现。还存在当前存储在数据库中的数据编码错误的可能性。 (您可以使用 utf8_encode ,但这只支持作为ISO 8859-1一部分的字符)。
最后,正如另一个答案,你需要确保你声明正确的字符集,使用HTTP头或其他(当然,这个特定的问题可能只是一个环境的工件,你做你的 print_r 测试)。
I have some json I need to decode, alter and then encode without messing up any characters.
If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or-\-or- control-character. But it doesn't work in python either.
{"Tag":"Odómetro"}
I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.
[Tag] => Odómetro
When I encode the array again I the character escaped to ascii, which is correct according to the json spec:
"Tag"=>"Od\u00f3metro"
Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.
Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.
$json = json_encode($array, JSON_UNESCAPED_UNICODE); Warning: json_encode() expects parameter 2 to be long, string ...
Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.
Here's why I think so:
- json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
- You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
- Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
- You said you had the same problem in Python, which would seem to exclude PHP from being the issue.
PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.
So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).
Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).
这篇关于PHP解码和编码json与unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!