本文介绍了如何序列化PySpark GroupedData对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在具有数百万条记录的数据集上运行groupBy(),并想要保存结果输出(PySpark GroupedData对象),以便稍后可以对其进行反序列化并从该点恢复(运行聚合)最重要的是).

I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed).

df.groupBy("geo_city")
<pyspark.sql.group.GroupedData at 0x10503c5d0>

我想避免将GroupedData对象转换为DataFrames或RDD,以便将其保存为文本文件或Parquet/Avro格式(因为转换操作很昂贵).还有其他有效的方法可以将GroupedData对象存储为某种二进制格式,以便更快地进行读取/写入吗?可能相当于Spark中的泡菜吗?

I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive). Is there some other efficient way to store the GroupedData object into some binary format for faster read/write? Possibly some equivalent of pickle in Spark?

推荐答案

没有任何东西是因为GroupedData并不是真正的东西.它根本不对数据执行任何操作.它仅描述在对后续agg的结果执行操作时应如何进行实际的聚合.

There is none because GroupedData is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg.

您可能会序列化底层JVM对象并在以后还原它,但这是浪费时间.由于groupBy仅描述必须执行的操作,因此可以忽略从头重新创建GroupedData对象的成本.

You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy only describes what has to be done the cost of recreating GroupedData object from scratch should be negligible.

这篇关于如何序列化PySpark GroupedData对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 07:21