我有一个JSON文件,其中包含这样的文本
.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
我的简单问题是如何将这些\ u代码转换(不删除)为空格,撇号和e.t.c ...?
输入:带
.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
的文本文件输出:
.....wax, and voila!(converted to the line break)At the moment you can't use our ...
Python代码
def TEST():
export= requests.get('https://sample.uk/', auth=('user', 'pass')).text
with open("TEST.json",'w') as file:
file.write(export.decode('utf8'))
我尝试过的
使用.json()
组合.encode()。decode()和e.t.c的任何不同方式。
编辑1
当我将此文件上传到BigQuery时,我有-
Â
符号更大的样本:
{
"xxxx1": "...You don\u2019t nee...",
"xxxx2": "...Gu\u00e9rer...",
"xxxx3": "...boost.\u00a0Sit back an....",
"xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
"xxxx5": "\u00a0\n\u00a0",
"xxxx6": "It was Christmas Eve babe\u2026",
"xxxx7": "It\u2019s xxx xxx\u2026"
}
Python代码:
import json
import re
import codecs
def load():
epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))
with open("TEST.json", "w") as file:
json.dump(x,file)
def unmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
if __name__ == '__main__':
load()
最佳答案
我已经制作了这种粗略的UTF-8拆解器,它似乎可以解决您的混乱编码情况:
import codecs
import re
import json
def unmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
用法:
broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)
converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)
data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])
它使用正则表达式从字符串中提取十六进制序列,将其转换为单个字节,并将其解码为UTF-8。
对于上面的示例字符串(我已经包含3字节字符
€
作为测试),将输出:JSON损坏
{“ some_key”:“ ... \ u00e2 \ u0080 \ u0099 w \ u0061x和瞧!\ u00c2 \ u00a0 \ u00c2 \ u00a0目前您无法使用我们的\ u00e2 \ u0082 \ u00ac ...”}
固定JSON
{“ some_key”:“……’蜡,瞧!目前您不能使用我们的€...”}
解析数据
{'some_key':“ ...’蜡,瞧!\ xa0 \ xa0目前您无法使用我们的€...”}
单值
...’蜡,瞧!目前,您无法使用我们的€...
“已解析数据”中的
\xa0
是由Python将命令输出到控制台的方式引起的,它仍然是实际的不间断空间。关于python - 文件包含\u00c2\u00a0,转换为字符,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56955320/