问题描述
我测试了以下内容的写作:
I tested writing with:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
但是,如果我忽略了分区:
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
它的执行速度提高了100倍(!).
It executes 100x(!) faster.
相同的数据量在分区时花费100倍的更长的写入时间正常吗?
Is it normal for the same amount of data to take 100x longer to write when partitioning?
分别有10个和3000个唯一的id
和name
列值.DataFrame
有10个附加的整数列.
There are 10 and 3000 unique id
and name
column values respectively.The DataFrame
has 10 additional integer columns.
推荐答案
第一个代码段将每个分区的拼花文件写入文件系统(本地或HDFS).这意味着,如果您有10个不同的ID和3000个不同的名称,则此代码将创建30000个文件.我怀疑创建文件,编写镶木地板元数据等的开销非常大(除了改组之外).
The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark不是最好的数据库引擎,如果您的数据集适合内存,我建议使用关系数据库.使用起来会更快,更容易.
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.
这篇关于Spark分区比没有分区慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!