问题描述
我从 JIRA 发现 SparkR 的 1.6 版本code>已经实现了包括
lag
和rank
的窗口函数,但是还没有实现over
函数.如何在 SparkR
(不是 SparkSQL
方式)中使用像 lag
函数那样没有 over
的窗口函数?有人能举个例子吗?
I found from JIRA that 1.6 release of SparkR
has implemented window functions including lag
and rank
, but over
function is not implemented yet. How can I use window function like lag
function without over
in SparkR
(not the SparkSQL
way)? Can someone provide an example?
推荐答案
Spark 2.0.0+
SparkR 为 DSL 包装器提供 over
、window.partitionBy
/partitionBy
、window.orderBy
/orderBy
和 rowsBetween
/rangeBeteen
函数.
SparkR provides DSL wrappers with over
, window.partitionBy
/ partitionBy
, window.orderBy
/ orderBy
and rowsBetween
/ rangeBeteen
functions.
火花
不幸的是,在 1.6.0 中是不可能的.虽然已经实现了一些窗口函数,包括 lag
,但 SparkR 不支持窗口定义,这使得这些完全无用.
Unfortunately it is not possible in 1.6.0. While some window functions, including lag
, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.
只要 SPARK-11395 没有解决,唯一的选择就是使用原始 SQL:
As long as SPARK-11395 is not resolved the only option is to use raw SQL:
set.seed(1)
hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")
sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>%
head()
## x y z _c3
## 1 1 1 -0.6264538 NA
## 2 4 1 1.5952808 -0.6264538
## 3 7 1 0.4874291 1.5952808
## 4 10 1 -0.3053884 0.4874291
## 5 2 2 0.1836433 NA
## 6 5 2 0.3295078 0.1836433
假设 相应的 PR 将被合并,而没有显着更改窗口定义和示例查询应如下所示:
Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:
w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))
这篇关于SparkR 窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!