本文介绍了通过python从行到行yelp数据集读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想将此代码更改为专门从1400001行读取为1450000.什么是修改?文件由单一对象类型组成,每行一个JSON对象. 我还想将输出保存到.csv文件.我该怎么办?
I want to change this code to specifically read from line 1400001 to 1450000. What is modification?file is composed of a single object type, one JSON-object per-line. I want also to save the output to .csv file. what should I do?
revu=[]
with open("review.json", 'r',encoding="utf8") as f:
for line in f:
revu = json.loads(line[1400001:1450000)
推荐答案
如果每行是JSON:
revu=[]
with open("review.json", 'r',encoding="utf8") as f:
# expensive statement, depending on your filesize this might
# let you run out of memory
revu = [json.loads(s) for s in f.readlines()[1400001:1450000]]
如果您在/etc/passwd文件中执行此操作,则很容易测试(当然没有json,因此可以忽略)
if you do it on the /etc/passwd file it is easy to test (no json of course, so that is left out)
revu = []
with open("/etc/passwd", 'r') as f:
# expensive statement
revu = [s for s in f.readlines()[5:10]]
print(revu) # gives entry 5 to 10
或者您遍历所有行,从而避免出现内存问题:
Or you iterate over all lines, saving you from memory issues:
revu = []
with open("...", 'r') as f:
for i, line in enumerate(f):
if i >= 1400001 and i <= 1450000:
revu.append(json.loads(line))
# process revu
至CSV ...
import pandas as pd
import json
def mylines(filename, _from, _to):
with open(filename, encoding="utf8") as f:
for i, line in enumerate(f):
if i >= _from and i <= _to:
yield json.loads(line)
df = pd.DataFrame([r for r in mylines("review.json", 1400001, 1450000)])
df.to_csv("/tmp/whatever.csv")
这篇关于通过python从行到行yelp数据集读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!