问题描述
要求:我有一个.gz格式的Json文件.因此,压缩后的大小约为500 MB.当我提取它时,json文件几乎变成了大约10 GB.提取的JSON文件逐行包含单个JSON对象.我想要的是使用任何bash脚本或python程序基于字段ps
对文件进行排序.
Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps
using either any bash script or python programs.
由于文件太大,建议不要将其加载到内存中.因此,我使用gzcat和cat bash命令流式传输JSON数据,然后将它们通过管道传输到jq以进行排序.但是系统在此过程中没有响应,或者在output.json中得到了空文件
Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json
>cat sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
>gzcat sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
硬件:16GB RAM,核心i5处理器
Hardware:16GB RAM, core i5 processor
示例JSON数据:-
{
"ps":"abc"
....
}
{
"ps":"def"
......
}
{
"ps":"abc"
....
}
预期输出:
{
"ps":"abc"
....
}
{
"ps":"abc"
....
}
{
"ps":"def"
....
}
我不明白我在做什么错.谁能建议如何对如此巨大的JSON文件进行排序?我关注的链接: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism
I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ?Links I followed:https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism
此外,如果没有Hadoop,我是否可以通过任何Map Reduce进行任何操作?
Also, is there any way I can do via any Map reduce without Hadoop ?
方法1:将数据流式传输到本地Sqlite数据库.
import sqlite3
import fileinput
PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"
def db_connect(db_path=PATH):
con = sqlite3.connect(db_path)
return con
con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj
record_count = 0
for line in fileinput.input():
cur.execute(insert_query,(line,))
命令行:
>gzcat sth.json.gz | python insert.py
推荐答案
以下是其中一项基于建议的解决方案:
Here is one solution based on the suggestion in one of the comments:
您可以使用jq沿着以下行执行此操作:
You can use jq to do this along the following lines:
jq -cr '"\(.ps)\t\(.)"'
这将产生带有制表符分隔值的行,如下所示:
This will produce lines with tab-separated values like so:
abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}
使用-c选项可确保将每对(即排序键和对象)写入一行.
Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.
现在,您可以轻松地对行进行排序,例如使用sort
;然后使用cut
剥离.ps字段.
Now you can easily sort the lines, e.g. using sort
; and then use e.g. cut
to strip the .ps field.
最后,如果您确实希望格式化输出,则可以再次使用jq
(例如jq .
),关键是jq默认是面向流的.
Finally, if you really want the output to be formatted, you can again use jq
( e.g. jq .
), the point being that jq is by default stream-oriented.
以上假设.ps值不带制表符.如果不是这种情况,则可以使用其他字段分隔符,也可以:
The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:
jq -cr '([.ps] | @tsv) + "\t" + tostring'
这篇关于使用bash或python对巨大的JSON文件进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!