在一次传递数据中使用pyspark查找最小值/最大值

本文介绍了在一次传递数据中使用pyspark查找最小值/最大值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有大量数字(文件中的行长)的RDD，我想知道如何单次通过数据来获取最小值/最大值.

I have an RDD with a huge list of numbers (length of lines from file), I want to know how to get the min/max in single pass over the data.

我知道Min和Max函数，但这需要两次通过.

I know that about Min and Max functions but that would require two passes.

推荐答案

尝试一下:

>>> from pyspark.statcounter import StatCounter
>>> 
>>> rdd = sc.parallelize([9, -1, 0, 99, 0, -10])
>>> stats = rdd.aggregate(StatCounter(), StatCounter.merge, StatCounter.mergeStats)
>>> stats.minValue, stats.maxValue
(-10.0, 99.0)

这篇关于在一次传递数据中使用pyspark查找最小值/最大值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！