问题描述
我目前正在一个项目中使用"Twitter帖子的情绪分析".我正在用Sentiment140对推文进行分类.使用该工具,我每天最多可以分类1,000,000条推文,并且我已经收集了大约750,000条推文.所以应该没问题.唯一的问题是,我一次最多可以向JSON批量分类发送15,000条Tweets.
I am currently working on a project where I use Sentiment Analysis for Twitter Posts.I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine.The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.
我的整个代码已设置并运行.唯一的问题是我的JSON文件现在包含所有750,000条推文.
My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.
因此,我的问题是:将JSON拆分为具有相同结构的较小文件的最佳方法是什么?我更愿意在Python中执行此操作.
Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.
我已经考虑过遍历文件.但是,如何在代码中指定应在例如5,000个元素之后创建一个新文件?
I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?
我希望得到一些提示,这是最合理的方法.谢谢!
I would love to get some hints what the most reasonable approach is. Thank you!
这是我目前所拥有的代码.
This is the code that I have at the moment.
import itertools
import json
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
# Open JSON file
values = open('Tweets.json').read()
#print values
# Adjust formatting of JSON file
values = values.replace('\n', '') # do your cleanup here
#print values
v = values.encode('utf-8')
#print v
# Load JSON file
v = json.loads(v)
print type(v)
for i, group in enumerate(grouper(v, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
输出给出:
["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]
在名为"outputbatch_0.json"的文件中
in a file called: "outputbatch_0.json"
这是JSON的结构.
{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}
推荐答案
使用迭代分组程序; itertools
模块食谱列表包括以下内容:
Use an iteration grouper; the itertools
module recipes list includes the following:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
这可让您以5000个为一组循环浏览您的推文:
This lets you iterate over your tweets in groups of 5000:
for i, group in enumerate(grouper(input_tweets, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
这篇关于使用Python将JSON文件拆分为相等/较小的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!