我正在尝试使用pyspark计算每个用户ID的 session 时长,数据示例如下:

diff_session.show(8,False):

|userid|platform            |previousTime           |currentTime            |timeDifference |
|1234  |13                  |null                   |2017-07-20 10:49:30.027|null           |
|1234  |13                  |null                   |2017-07-20 10:04:23.1  |null           |
|1234  |13                  |2017-07-20 10:04:23.1  |2017-07-20 10:06:23.897|120            |
|1234  |13                  |2017-07-20 10:04:23.897|2017-07-20 10:40:29.472|2166           |
|1234  |13                  |2017-07-20 10:40:29.472|2017-07-20 10:40:50.347|11             |
|1234  |13                  |2017-07-20 10:40:30.347|2017-07-20 10:51:16.458|646            |
|1234  |13                  |2017-07-20 10:51:16.458|2017-07-20 10:51:17.427|1              |
  • 我想按用户ID和平台
  • 分组
  • 然后我要在该组中设置currentTime == previousTime(如果timeDifference> 2000或timeDifference == null),那么我在下面做了尝试:
    from pyspark.sql import SQLContext, functions
    
    df_session.select(df_session.userid, df_session.platform, functions.when(time_difference > 2000) THEN previousTime).otherwise(currentTime)
    
    df_session.select(df_session.userid, df_session.platform, functions.when(time_difference is null) THEN currentTime).otherwise(previousTime)
    
  • 然后我想将所有timeDifference加起来(如果它小于2000),并让currentTime添加TotalTimeDifference。因此结果将是这样的:
    |userid|platform            |previousTime           |currentTime            |timeDifference |
    |1234  |13                  |2017-07-20 10:49:30.027|2017-07-20 10:49:30.027|0              |
    |1234  |13                  |2017-07-20 10:04:23.1  |2017-07-20 10:04:23.1  |0              |
    |1234  |13                  |2017-07-20 10:04:23.1  |2017-07-20 10:06:23.897|120            |
    |1234  |13                  |2017-07-20 10:04:23.897|2017-07-20 10:04:23.897|0              |
    |1234  |13                  |2017-07-20 10:40:29.472|2017-07-20 10:51:17.427|658            |
    

  • 最后一部分非常棘手,我还不知道从哪里开始。谢谢。

    最佳答案

    希望这可以帮助!

    import pyspark.sql.functions as func
    from datetime import datetime, timedelta
    from pyspark.sql.types import StringType
    
    df = sc.parallelize([('1234','13','','2017-07-20 10:49:30.027',''),
                        ('1234','13','','2017-07-20 10:04:23.100',''),
                        ('1234','13','2017-07-20 10:04:23.100','2017-07-20 10:06:23.897',120),
                        ('1234','13','2017-07-20 10:04:23.897','2017-07-20 10:40:29.472',2166),
                        ('1234','13','2017-07-20 10:40:29.472','2017-07-20 10:40:50.347',11),
                        ('1234','13','2017-07-20 10:40:30.347','2017-07-20 10:51:16.458',646),
                        ('1234','13','2017-07-20 10:51:16.458','2017-07-20 10:51:17.427',1),
                        ('7777','44','2017-07-20 10:31:16.458','2017-07-20 10:47:16.458',1000),
                        ('7777','44','2017-07-20 11:11:16.458','2017-07-20 11:36:16.458',1500),
                        ('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:36.458',20),
                        ('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:26.458',10)
                        ]).\
        toDF(['userid','platform','previousTime','currentTime','timeDifference'])
    df.show()
    
    # missing value & outlier treatment
    df1 = df.select("userid","platform", func.when(df.timeDifference=='', df.currentTime).otherwise(df.previousTime),
                    func.when(df.timeDifference > 2000, df.previousTime).otherwise(df.currentTime),
                    func.when(df.timeDifference=='', 0).when(df.timeDifference > 2000, 0).otherwise(df.timeDifference))
    oldColumns = df1.schema.names
    newColumns = ["userid", "platform", "previousTime", "currentTime", "timeDifference"]
    df1 = reduce(lambda df1, idx: df1.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df1)
    df1.show()
    
    # first part of result i.e. records where timeDifference = 0
    df_final_part0 = df1.where("timeDifference = 0")
    
    # identify records where sum(timeDifference) < 2000
    df2 = df1.where("timeDifference <> 0")
    df3 = df2.groupby("userid","platform").agg(func.sum("timeDifference")).\
        withColumnRenamed("sum(timeDifference)", "sum_timeDifference").where("sum_timeDifference < 2000")
    
    # second part of result i.e. records where sum(timeDifference) is >= 2000
    df_final_part1 = df2.join(df3, ["userid","platform"],"leftanti")
    
    # third part of result
    df_final_part2 = df2.join(df3,on=['userid','platform']).select('userid','platform',"previousTime","sum_timeDifference").\
        groupBy('userid','platform',"sum_timeDifference").agg(func.min("previousTime")).\
        withColumnRenamed("min(previousTime)", "previousTime").withColumnRenamed("sum_timeDifference", "timeDifference")
    def processdate(x, time_in_sec):
        x = datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')
        x += timedelta(milliseconds= time_in_sec * 1e3)
        return x.strftime('%Y-%m-%d %H:%M:%S.%f')
    f1 = func.udf(processdate,StringType())
    df_final_part2 = df_final_part2.withColumn("currentTime",f1(df_final_part2.previousTime,df_final_part2.timeDifference)).\
        select('userid','platform',"previousTime","currentTime","timeDifference")
    
    # combine all three parts to get the final result
    result = df_final_part0.unionAll(df_final_part1).unionAll(df_final_part2)
    result.show()
    

    不要忘了告诉我们它是否解决了您的问题:)

    10-08 12:37