问题描述
我在具有数百万条记录的数据集上运行 groupBy()
并希望保存结果输出(PySpark GroupedData
对象),以便我可以- 稍后对其进行序列化并从该点恢复(根据需要在其上运行聚合).
I am running a groupBy()
on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData
object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed).
df.groupBy("geo_city")
<pyspark.sql.group.GroupedData at 0x10503c5d0>
我想避免将 GroupedData 对象转换为 DataFrame 或 RDD,以便将其保存为文本文件或 Parquet/Avro 格式(因为转换操作很昂贵).是否有其他有效的方法可以将 GroupedData
对象存储为某种二进制格式以加快读/写速度?可能相当于 Spark 中的泡菜?
I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive). Is there some other efficient way to store the GroupedData
object into some binary format for faster read/write? Possibly some equivalent of pickle in Spark?
推荐答案
没有,因为 GroupedData
并不是真正的东西.它根本不对数据执行任何操作.它仅描述了当您对后续 agg
的结果执行操作时实际聚合应该如何进行.
There is none because GroupedData
is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg
.
您可能可以序列化底层 JVM 对象并在以后恢复它,但这很浪费时间.由于 groupBy
仅描述必须完成的操作,因此从头开始重新创建 GroupedData
对象的成本应该可以忽略不计.
You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy
only describes what has to be done the cost of recreating GroupedData
object from scratch should be negligible.
这篇关于如何序列化 PySpark GroupedData 对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!