本文介绍了pyspark中的first_value窗口函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pyspark 1.5从Hive表中获取数据并尝试使用窗口功能.

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

根据存在一个名为firstValue的解析函数,它将为我提供给定窗口的第一个非空值.我知道它存在于Hive中,但我在pyspark的任何地方都找不到.

According to this there exists an analytic function called firstValue that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere.

鉴于pyspark不允许UserDefinedAggregateFunctions(UDAF),是否有一种方法可以实现此目的?

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)?

推荐答案

火花> = 2.0 :

first带有一个可选的ignorenulls参数,该参数可以模拟first_value的行为:

first takes an optional ignorenulls argument which can mimic the behavior of first_value:

df.select(col("k"), first("v", True).over(w).alias("fv"))

火花< 2.0 :

可用功能称为 first ,可以按以下方式使用:

Available function is called first and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

但是如果您想忽略null,则必须直接使用Hive UDF:

but if you want to ignore nulls you'll have to use Hive UDFs directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

这篇关于pyspark中的first_value窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:53