问题描述
我正在尝试使用Python将Amazon S3中的大ZIPPED JSON FILE导入AWS RDS-PostgreSQL。但是,发生了这些错误,
I'm trying to import a large size of ZIPPED JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured,
文件 my_code.py,第64行,在
中file_content = f.read()。decode('utf-8')。splitlines(True)
File "my_code.py", line 64, in file_content = f.read().decode('utf-8').splitlines(True)
文件 /usr/lib64/python3.6/zipfile.py,行835,读入
buf + = self._read1(self.MAX_N)
File "/usr/lib64/python3.6/zipfile.py", line 835, in read buf += self._read1(self.MAX_N)
文件 /usr/lib64/python3.6/zipfile.py,第925行,位于_read1
data = self._decompressor.decompress(data,n)
File "/usr/lib64/python3.6/zipfile.py", line 925, in _read1 data = self._decompressor.decompress(data, n)
MemoryError
MemoryError
// my_code.py
//my_code.py
import sys
import boto3
import psycopg2
import zipfile
import io
import json
import config
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()
bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)
def insert_query():
query = """
INSERT INTO data_table
SELECT
(src.test->>'url')::varchar, (src.test->>'id')::bigint,
(src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
FROM (SELECT CAST(%s AS JSONB) AS test) src
"""
cursor.execute(query, (json.dumps(data),))
if key.endswith('.zip'):
zip_files = obj['Body'].read()
with io.BytesIO(zip_files) as zf:
zf.seek(0)
with zipfile.ZipFile(zf, mode='r') as z:
for filename in z.namelist():
with z.open(filename) as f:
file_content = f.read().decode('utf-8').splitlines(True)
for row in file_content:
data = json.loads(row)
insert_query()
if key.endswith('.json'):
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for row in file_content:
data = json.loads(row)
insert_query()
connection.commit()
connection.close()
有没有解决这些问题的方法?任何帮助都可以,非常感谢您!
Are there any solutions to these problems? Any help would do, thank you so much!
推荐答案
问题是您试图将整个文件读入内存中。时间,如果文件确实太大,可能会导致内存不足。
The problem is that you try to read an entire file into memory at a time, which can cause you to run out of memory if the file is indeed too large.
您应该一次读取一行文件,因为每一行都在文件显然是JSON字符串,您可以直接在循环中处理每一行:
You should read the file one line at a time, and since each line in a file is apparently a JSON string, you can process each line directly in the loop:
with z.open(filename) as f:
for line in f:
insert_query(json.loads(line.decode('utf-8')))
您的 insert_query
函数应接受 data
作为参数,方式:
Your insert_query
function should accept data
as a parameter, by the way:
def insert_query(data):
这篇关于使用Python从Amazon S3将大尺寸的压缩JSON文件导入AWS RDS-PostgreSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!