本文介绍了SparkR 窗口函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 JIRA 发现 SparkR 的 1.6 版本code>已经实现了包括lagrank的窗口函数,但是还没有实现over函数.如何在 SparkR(不是 SparkSQL 方式)中使用像 lag 函数那样没有 over 的窗口函数?有人能举个例子吗?

I found from JIRA that 1.6 release of SparkR has implemented window functions including lag and rank, but over function is not implemented yet. How can I use window function like lag function without over in SparkR(not the SparkSQL way)? Can someone provide an example?

推荐答案

Spark 2.0.0+

SparkR 为 DSL 包装器提供 overwindow.partitionBy/partitionBywindow.orderBy/orderByrowsBetween/rangeBeteen 函数.

SparkR provides DSL wrappers with over, window.partitionBy / partitionBy, window.orderBy / orderBy and rowsBetween / rangeBeteen functions.

火花

不幸的是,在 1.6.0 中是不可能的.虽然已经实现了一些窗口函数,包括 lag,但 SparkR 不支持窗口定义,这使得这些完全无用.

Unfortunately it is not possible in 1.6.0. While some window functions, including lag, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.

只要 SPARK-11395 没有解决,唯一的选择就是使用原始 SQL:

As long as SPARK-11395 is not resolved the only option is to use raw SQL:

set.seed(1)

hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")

sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>% 
  head()

##    x y          z        _c3
## 1  1 1 -0.6264538         NA
## 2  4 1  1.5952808 -0.6264538
## 3  7 1  0.4874291  1.5952808
## 4 10 1 -0.3053884  0.4874291
## 5  2 2  0.1836433         NA
## 6  5 2  0.3295078  0.1836433

假设 相应的 PR 将被合并,而没有显着更改窗口定义和示例查询应如下所示:

Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:

w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))

这篇关于SparkR 窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 21:47