python - 文件包含\u00c2\u00a0，转换为字符

我有一个JSON文件，其中包含这样的文本

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

我的简单问题是如何将这些\ u代码转换（不删除）为空格，撇号和e.t.c ...？

输入：带.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...的文本文件

输出：.....wax, and voila!(converted to the line break)At the moment you can't use our ...

Python代码

def TEST():
        export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

        with open("TEST.json",'w') as file:
            file.write(export.decode('utf8'))

我尝试过的

使用.json（）
组合.encode（）。decode（）和e.t.c的任何不同方式。

编辑1

当我将此文件上传到BigQuery时，我有-Â符号

更大的样本：

{
    "xxxx1": "...You don\u2019t nee...",
    "xxxx2": "...Gu\u00e9rer...",
    "xxxx3": "...boost.\u00a0Sit back an....",
    "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
    "xxxx5": "\u00a0\n\u00a0",
    "xxxx6": "It was Christmas Eve babe\u2026",
    "xxxx7": "It\u2019s xxx xxx\u2026"
}

Python代码：

import json
import re
import codecs


def load():
    epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
    x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

    with open("TEST.json", "w") as file:
        json.dump(x,file)

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
    load()

最佳答案

我已经制作了这种粗略的UTF-8拆解器，它似乎可以解决您的混乱编码情况：

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

用法：

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

它使用正则表达式从字符串中提取十六进制序列，将其转换为单个字节，并将其解码为UTF-8。

对于上面的示例字符串（我已经包含3字节字符€作为测试），将输出：

JSON损坏
{“ some_key”：“ ... \ u00e2 \ u0080 \ u0099 w \ u0061x和瞧！\ u00c2 \ u00a0 \ u00c2 \ u00a0目前您无法使用我们的\ u00e2 \ u0082 \ u00ac ...”}
固定JSON
{“ some_key”：“……’蜡，瞧！目前您不能使用我们的€...”}
解析数据
{'some_key'：“ ...’蜡，瞧！\ xa0 \ xa0目前您无法使用我们的€...”}
单值
...’蜡，瞧！目前，您无法使用我们的€...

“已解析数据”中的\xa0是由Python将命令输出到控制台的方式引起的，它仍然是实际的不间断空间。

关于python - 文件包含\u00c2\u00a0，转换为字符，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/56955320/