本文介绍了修复无效的 JSON 八进制转义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

KISSmetrics 生成了我需要解析的无效 JSON 字符串.我收到了很多错误,比如

ERROR 2013-03-04 04:31:12,253 Invalid \escape: line 1 column 132 (char 132): {"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","搜索词":"新年快乐动画 1920\303\2271080 高清","_t":1356390128}错误 2013-03-04 04:34:19,153 无效 \escape:第 1 行第 101 列(字符 101):{搜索引擎":Google",_n":广告活动命中",_p":byskpczsw6sorbmzqi0tk1uimgw=","搜索词":"\331\203\330\261\330\252\331\207\331\201\331\212\330\257\331\212\330\244\331\211\330\256\331\212\331\204\330\247\330\255\331\211\331\203\331\210\330\261\330\257\331\211","_t835":135}

我的代码是:

 for line in lines:尝试:数据 = self.clean_data(json.loads(line))除了 ValueError,e:logger.error('%s: %s' % (e.message, line))

示例原始数据:

{"search engine":"Google","_n":"搜索引擎命中","_p":"kvceh84hzbhywcnlivv+hdztizw=","搜索词":"军事音效程序","_t":1356034177}

有没有机会清理这个凌乱的 JSON 并解析它?感谢您的帮助.

解决方案

您的输入数据包含八进制转义;那些确实是无效的.使用正则表达式将它们替换为解码的字节:

导入重新invalid_escape = re.compile(r'\\[0-7]{1,3}') # 最多 3 位字节值最多 FFdef replace_with_byte(match):返回 chr(int(match.group(0)[1:], 8))定义修复(brokenjson):返回 invalid_escape.sub(replace_with_byte,brokenjson)

这使您的输入有效:

>>>data1 = r"""{"search engine":"Google","_n":"搜索引擎命中","_p":"z392cpdpnm6silblq5mac8kiugq=","搜索词":"新年快乐动画1920\303\2271080高清","_t":1356390128}""">>>json.loads(修复(数据1)){u'_n':u'搜索引擎命中',u'搜索词':u'新年快乐动画1920\xd71080高清',u'_p':u'z392cpdpnm6silblq5mac8kiugq=',u'_t':1356390128,u'搜索引擎':u'Google'}>>>打印 json.loads(repair(data1))['搜索词']新年快乐动画1920×1080高清>>>data2 = r"""{"搜索引擎":"Google","_n":"广告活动命中","_p":"byskpczsw6sorbmzqi0tk1uimgw=","搜索词":"\331\203\330\261\330\252\331\207\331\201\331\212\330\257\331\212\330\244\331\211\330\256\331\212\331\204\330\247\333255\331\211\331\203\331\210\330\261\330\257\331\211","_t":1356483052}""">>>json.loads(repair(data2)){u'_n': u'ad campaign hit', u'search terms': u'\u0643\u0631\u062a\u0647\u0641\u064a\u062f\u064a\u0694\u064\u062e\u064a\u0644\u0627\u062d\u0649\u0643\u0648\u0631\u062f\u0649', u'_p': u'byskpczsw6sorbmzqi0tk1u'img0t6','3'4u_52',谷歌'}>>>打印 json.loads(repair(data2))['搜索词']كرته فيديؤى خيلاحى كوردى

KISSmetrics generates invalid JSON strings I need to parse. I'm getting tons of errors like

ERROR 2013-03-04 04:31:12,253 Invalid \escape: line 1 column 132 (char 132): {"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 1920\303\2271080 hd","_t":1356390128}

ERROR 2013-03-04 04:34:19,153 Invalid \escape: line 1 column 101 (char 101): {"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"\331\203\330\261\330\252\331\207 \331\201\331\212\330\257\331\212\330\244\331\211 \330\256\331\212\331\204\330\247\330\255\331\211 \331\203\331\210\330\261\330\257\331\211","_t":1356483052}

My code is:

for line in lines:
    try:
        data = self.clean_data(json.loads(line))
        except ValueError, e:
            logger.error('%s: %s' % (e.message, line))

Example raw data:

{"search engine":"Google","_n":"search engine hit","_p":"kvceh84hzbhywcnlivv+hdztizw=","search terms":"military sound effects programs","_t":1356034177}

Is there any chance to cleanup this messy JSON and parse it? Thanks for your help.

解决方案

Your input data contains octal escapes; those would be invalid indeed. Replace them with decoded bytes using a regular expression:

import re

invalid_escape = re.compile(r'\\[0-7]{1,3}')  # up to 3 digits for byte values up to FF

def replace_with_byte(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_byte, brokenjson)

This makes your input work:

>>> data1 = r"""{"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 1920\303\2271080 hd","_t":1356390128}"""
>>> json.loads(repair(data1))
{u'_n': u'search engine hit', u'search terms': u'happy new year animation 1920\xd71080 hd', u'_p': u'z392cpdpnm6silblq5mac8kiugq=', u'_t': 1356390128, u'search engine': u'Google'}
>>> print json.loads(repair(data1))['search terms']
happy new year animation 1920×1080 hd
>>> data2 = r"""{"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"\331\203\330\261\330\252\331\207 \331\201\331\212\330\257\331\212\330\244\331\211 \330\256\331\212\331\204\330\247\330\255\331\211 \331\203\331\210\330\261\330\257\331\211","_t":1356483052}"""
>>> json.loads(repair(data2)){u'_n': u'ad campaign hit', u'search terms': u'\u0643\u0631\u062a\u0647 \u0641\u064a\u062f\u064a\u0624\u0649 \u062e\u064a\u0644\u0627\u062d\u0649 \u0643\u0648\u0631\u062f\u0649', u'_p': u'byskpczsw6sorbmzqi0tk1uimgw=', u'_t': 1356483052, u'search engine': u'Google'}
>>> print json.loads(repair(data2))['search terms']
كرته فيديؤى خيلاحى كوردى

这篇关于修复无效的 JSON 八进制转义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 02:45