使用bash或python对巨大的JSON文件进行排序

本文介绍了使用bash或python对巨大的JSON文件进行排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

要求:我有一个.gz格式的Json文件.因此，压缩后的大小约为500 MB.当我提取它时，json文件几乎变成了大约10 GB.提取的JSON文件逐行包含单个JSON对象.我想要的是使用任何bash脚本或python程序基于字段ps对文件进行排序.

Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps using either any bash script or python programs.

由于文件太大，建议不要将其加载到内存中.因此，我使用gzcat和cat bash命令流式传输JSON数据，然后将它们通过管道传输到jq以进行排序.但是系统在此过程中没有响应，或者在output.json中得到了空文件

Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json

>cat  sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"
>gzcat  sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"

硬件:16GB RAM，核心i5处理器

Hardware:16GB RAM, core i5 processor

示例JSON数据:-

{
    "ps":"abc"
    ....
}
{   
    "ps":"def"
    ......
}
{
    "ps":"abc"
    ....
}

预期输出:

{
    "ps":"abc"
    ....
}
{   
    "ps":"abc"
    ....
}
{
    "ps":"def"
    ....
}

我不明白我在做什么错.谁能建议如何对如此巨大的JSON文件进行排序?我关注的链接: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ?Links I followed:https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

此外，如果没有Hadoop，我是否可以通过任何Map Reduce进行任何操作?

Also, is there any way I can do via any Map reduce without Hadoop ?

方法1:将数据流式传输到本地Sqlite数据库.

import sqlite3
import fileinput

PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"

def db_connect(db_path=PATH):
    con = sqlite3.connect(db_path)
    return con

con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj

record_count = 0
for line in fileinput.input():
    cur.execute(insert_query,(line,))

命令行:

>gzcat sth.json.gz | python insert.py

推荐答案

以下是其中一项基于建议的解决方案:

Here is one solution based on the suggestion in one of the comments:

您可以使用jq沿着以下行执行此操作:

You can use jq to do this along the following lines:

jq -cr '"\(.ps)\t\(.)"'

这将产生带有制表符分隔值的行，如下所示:

This will produce lines with tab-separated values like so:

abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}

使用-c选项可确保将每对(即排序键和对象)写入一行.

Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.

现在，您可以轻松地对行进行排序，例如使用sort;然后使用cut剥离.ps字段.

Now you can easily sort the lines, e.g. using sort; and then use e.g. cut to strip the .ps field.

最后，如果您确实希望格式化输出，则可以再次使用jq(例如jq .)，关键是jq默认是面向流的.

Finally, if you really want the output to be formatted, you can again use jq ( e.g. jq .), the point being that jq is by default stream-oriented.

以上假设.ps值不带制表符.如果不是这种情况，则可以使用其他字段分隔符，也可以:

The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:

jq -cr '([.ps] | @tsv) + "\t" + tostring'

这篇关于使用bash或python对巨大的JSON文件进行排序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！