问题描述
我有一个千兆字节的JSON文件.该文件由每个不超过数千个字符的JSON对象组成,但是记录之间没有换行符.
I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.
使用Python 3和json
模块,如何一次从文件读取一个JSON对象到内存?
Using Python 3 and the json
module, how can I read one JSON object at a time from the file into memory?
数据在纯文本文件中.这是类似记录的示例.实际记录包含许多嵌套的字典和列表.
The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.
以可读格式记录:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
实际格式.新记录一个接一个地开始,没有任何中断.
Actual format. New records start one after the other without any breaks.
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
推荐答案
通常来说,将多个JSON对象放入文件中会使该文件无效且损坏的JSON .也就是说,您仍然可以使用 方法.
Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode()
method.
以下将在解析器找到它们时产生完整的对象:
The following will yield complete objects as the parser finds them:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
此函数将从buffersize
块中的给定文件对象中读取块,并使decoder
对象从缓冲区中解析整个JSON对象.每个解析的对象都交给调用者.
This function will read chunks from the given file object in buffersize
chunks, and have the decoder
object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.
像这样使用它:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
仅当您的JSON对象背对背写到文件中且中间没有换行符时,才使用此选项.如果要做有换行符,并且每个JSON对象都限于一行,则您有 JSON行文档,在这种情况下,您可以使用加载并解析
Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you do have newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Python instead.
这篇关于如何使用"json"模块一次读取一个JSON对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!