如何序列化 PySpark GroupedData 对象?

本文介绍了如何序列化 PySpark GroupedData 对象?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在具有数百万条记录的数据集上运行 groupBy() 并希望保存结果输出(PySpark GroupedData 对象)，以便我可以- 稍后对其进行序列化并从该点恢复(根据需要在其上运行聚合).

I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed).

df.groupBy("geo_city")
<pyspark.sql.group.GroupedData at 0x10503c5d0>

我想避免将 GroupedData 对象转换为 DataFrame 或 RDD，以便将其保存为文本文件或 Parquet/Avro 格式(因为转换操作很昂贵).是否有其他有效的方法可以将 GroupedData 对象存储为某种二进制格式以加快读/写速度?可能相当于 Spark 中的泡菜?

I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive). Is there some other efficient way to store the GroupedData object into some binary format for faster read/write? Possibly some equivalent of pickle in Spark?

GroupedData

如何序列化 PySpark GroupedData 对象?

问题描述

推荐答案