从PySpark的RDD中的数据中找到最小和最大日期

本文介绍了从PySpark的RDD中的数据中找到最小和最大日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在将 Spark 与 Ipython 一起使用，并具有 RDD 在打印时包含此格式的数据：

I am using Spark with Ipython and have a RDD which contains data in this format when printed:

print rdd1.collect（）

[u'2010-12-08 00:00:00', u'2010-12-18 01:20:00', u'2012-05-13 00:00:00',....]

每个数据都是 datetimestamp ，我想在此 RDD 中找到最小值和最大值。我该怎么做？

Each data is a datetimestamp and I want to find the minimum and the maximum in this RDD. How can I do that?

推荐答案

例如，您可以使用 aggregate 函数（有关其工作原理的解释，请参见：）

You can for example use aggregate function (for an explanation how it works see: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?)

from datetime import datetime    

rdd  = sc.parallelize([
    u'2010-12-08 00:00:00', u'2010-12-18 01:20:00', u'2012-05-13 00:00:00'])

def seq_op(acc, x):
    """ Given a tuple (min-so-far, max-so-far) and a date string
    return a tuple (min-including-current, max-including-current)
    """
    d = datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
    return (min(d, acc[0]), max(d, acc[1]))

def comb_op(acc1, acc2):
    """ Given a pair of tuples (min-so-far, max-so-far)
    return a tuple (min-of-mins, max-of-maxs)
    """
    return (min(acc1[0], acc2[0]), max(acc1[1], acc2[1]))

# (initial-min <- max-date, initial-max <- min-date)
rdd.aggregate((datetime.max, datetime.min), seq_op, comb_op)

## (datetime.datetime(2010, 12, 8, 0, 0), datetime.datetime(2012, 5, 13, 0, 0))

或 DataFrames ：

from pyspark.sql import Row
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max

row = Row("ts")
df = rdd.map(row).toDF()

df.withColumn("ts", unix_timestamp("ts")).agg(
    from_unixtime(min("ts")).alias("min_ts"), 
    from_unixtime(max("ts")).alias("max_ts")
).show()

## +-------------------+-------------------+
## |             min_ts|             max_ts|
## +-------------------+-------------------+
## |2010-12-08 00:00:00|2012-05-13 00:00:00|
## +-------------------+-------------------+

这篇关于从PySpark的RDD中的数据中找到最小和最大日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！