问题描述
我正在使用 spark sql 对我的数据集运行查询.查询的结果很小,但仍然是分区的.
I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.
我想合并生成的 DataFrame 并按列对行进行排序.我试过了
I would like to coalesce the resulting DataFrame and order the rows by a column. I tried
DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
我也试过
DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
输出文件按块排序(即分区已排序,但数据帧未作为整体排序).例如,代替
the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of
1, value
2, value
4, value
4, value
5, value
5, value
...
我明白
2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
- 获得查询结果绝对顺序的正确方法是什么?
- 为什么数据帧没有合并到一个分区中?
推荐答案
我想在这里提几点.1- 源代码显示 orderBy 语句在内部调用排序 api,全局排序设置为 true .因此,输出级别缺少排序表明在写入目标时排序丢失.我的观点是对 orderBy 的调用总是需要全局顺序.
I want to mention couple of things here .1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.
2- 使用激烈的合并,就像在您的情况下强制单个分区一样,可能非常危险.我建议你不要这样做.源代码表明调用 coalesce(1) 可能会导致上游转换使用单个分区.这将是残酷的表现.
2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.
3- 您似乎希望使用单个分区执行 orderBy 语句.我不认为我同意这种说法.这将使 Spark 成为一个非常愚蠢的分布式框架.
3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.
如果您同意或不同意声明,请让社区知道.
Community please let me know if you agree or disagree with statements.
你是如何从输出中收集数据的?
how are you collecting data from the output anyway?
也许输出实际上包含已排序的数据,但是您为了从输出中读取而执行的转换/操作是造成订单丢失的原因.
maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.
这篇关于SparkSQL DataFrame 跨分区排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!