问题描述
我想访问spark数据帧的前100行,并将结果写回CSV文件.
I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.
为什么take(100)
基本上是即时的,而
Why is take(100)
basically instant, whereas
df.limit(100)
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.option("header", true)
.option("delimiter", ";")
.csv("myPath")
永远存在.我不想获得每个分区的前100条记录,而只是获得任何100条记录.
takes forever.I do not want to obtain the first 100 records per partition but just any 100 records.
推荐答案
这是因为Spark当前不支持谓词下推,请参见这个很好的答案.
This is because predicate pushdown is currently not supported in Spark, see this very good answer.
实际上,take(n)也应该花费很长时间.但是,我刚刚对其进行了测试,并得到了与您相同的结果-无论数据库大小如何,take几乎都是瞬时的,而limit需要很多时间.
Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.
这篇关于火花访问前n行-限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!