问题描述
我有从2017年1月1日到2017年1月7日开始的数据,这是每周需要的每周汇总.我以下列方式使用了窗口功能
I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
我在数据框中的数据为
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
我得到的输出为:
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
它显示我的一天是从2016年12月29日开始,但实际数据是从2017年1月1日开始,为什么会出现这种保证金?
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?
推荐答案
对于像这样的滚动窗口,可以设置开始时间的偏移量,有关更多信息,请参见博客此处.使用滑动窗口,但是,通过将窗口持续时间"和滑动持续时间"设置为相同的值,它将与具有起始偏移量的滚动窗口相同.
For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
语法如下,
window(column, window duration, sliding duration, starting offset)
使用您的值,我发现偏移量为64小时的起始时间为2017-01-01 00:00:00
.
With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00
.
val data = Seq(("2017-01-01 00:00:00",1.0),
("2017-01-01 00:15:00",2.0),
("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
.withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))
val df2 = df
.groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
.agg(sum("value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
将给出此结果数据框:
+-------------------+-------------------+-------------+
| start| end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00| 3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00| 1.43|
+-------------------+-------------------+-------------+
这篇关于在Spark中使用Windows函数进行每周汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!