问题描述
我有一个非标准的"JSON"文件要解析.每个项目均以分号分隔,而不是逗号分隔.我不能简单地将;
替换为,
,因为可能会有一些包含;
的值,例如. 你好,世界".如何将其解析为与JSON通常解析的结构相同的结构?
I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ;
with ,
because there might be some value containing ;
, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
推荐答案
使用Python tokenize
模块,将文本流转换为一个逗号(而不是分号)的流. Python标记器也很乐意处理JSON输入,甚至包括分号.令牌生成器将字符串显示为整个令牌,而原始"分号在流中显示为单个token.OP
令牌,供您替换:
Use the Python tokenize
module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP
tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
这假设一旦用逗号替换了分号,则格式会变成有效的JSON;例如不允许在结束]
或}
之前使用尾部逗号,尽管您甚至可以跟踪最后添加的逗号,如果下一个非换行标记是右括号,则可以将其再次删除.
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ]
or }
allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
演示:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
令牌间空格已删除,但可以通过注意每个令牌中的tokenize.NL
令牌以及(lineno, start)
和(lineno, end)
位置元组来重新设置.由于令牌周围的空格对于JSON解析器而言无关紧要,因此我不必为此烦恼.
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL
tokens and the (lineno, start)
and (lineno, end)
position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.
这篇关于解析非标准分号分隔的"JSON"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!