PySpark:如何从 spark 数据框创建嵌套的 JSON?

def split_df(df):for (vendor, count), df.groupby(["Vendor_Name", "count"]) 中的 df_vendor:屈服 {vendor_name":供应商，计数":计数，类别":列表(split_category(df_vendor))}def split_category(df_vendor):for (category, count), df_vendor.groupby 中的 df_category([类别"，类别_计数"]):屈服 {名称":类别，计数":计数，子类别":列表(split_subcategory(df_category))，}def split_subcategory(df_category):对于 df.itertuples() 中的行:产量 {"name": row.Subcategory, "count": row.Subcategory_Count}列表(split_df(df))[{"vendor_name": "供应商 1",计数":10，类别":[{"name": "类别 1",计数":4，子类别":[{"name": "子类别 1", "count": 1},{"name": "子类别 2", "count": 2},{"name": "子类别 3", "count": 3},{"name": "子类别 4", "count": 4},],}],}]要将其导出到 json，您需要一种导出 np.int64 的方法I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please helpdf.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)Update1:As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside that array i want to put subcategory and count.Sample text data:Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_CountVendor1,10,Category 1,4,Sub Category 1,1Vendor1,10,Category 1,4,Sub Category 2,2Vendor1,10,Category 1,4,Sub Category 3,3Vendor1,10,Category 1,4,Sub Category 4,4j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False) .apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r')) .reset_index() .rename(columns={0:'subcategories'}) .to_json(orient='records'))[{ "vendor_name": "Vendor 1", "count": 10, "categories": [{ "name": "Category 1", "count": 4, "subCategories": [{ "name": "Sub Category 1", "count": 1 }, { "name": "Sub Category 2", "count": 1 }, { "name": "Sub Category 3", "count": 1 }, { "name": "Sub Category 4", "count": 1 } ] }] 解决方案 The easiest way to do this in python/pandas would be to use a series of nested generators using groupby I think:def split_df(df): for (vendor, count), df_vendor in df.groupby(["Vendor_Name", "count"]): yield { "vendor_name": vendor, "count": count, "categories": list(split_category(df_vendor)) }def split_category(df_vendor): for (category, count), df_category in df_vendor.groupby( ["Categories", "Category_Count"] ): yield { "name": category, "count": count, "subCategories": list(split_subcategory(df_category)), }def split_subcategory(df_category): for row in df.itertuples(): yield {"name": row.Subcategory, "count": row.Subcategory_Count}list(split_df(df))[ { "vendor_name": "Vendor1", "count": 10, "categories": [ { "name": "Category 1", "count": 4, "subCategories": [ {"name": "Sub Category 1", "count": 1}, {"name": "Sub Category 2", "count": 2}, {"name": "Sub Category 3", "count": 3}, {"name": "Sub Category 4", "count": 4}, ], } ], }]To export this to json, you'll need a way to export the np.int64 这篇关于PySpark:如何从 spark 数据框创建嵌套的 JSON?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！