我有一些具有以下结构的数据框:
ID| Page | User | Timestamp |
|1|Page 1 |Ericd |2002-09-07 19:39:55|
|1|Page 1 |Liir |2002-10-12 03:01:42|
|1|Page 1 |Tubby |2002-10-12 03:02:23|
|1|Page 1 |Mojo |2002-10-12 03:18:24|
|1|Page 1 |Kirf |2002-10-12 03:19:03|
|2|Page 2 |The Epopt |2001-11-28 22:27:37|
|2|Page 2 |Conversion script|2002-02-03 01:49:16|
|2|Page 2 |Bryan Derksen |2002-02-25 16:51:15|
|2|Page 2 |Gear |2002-10-04 12:46:06|
|2|Page 2 |Tim Starling |2002-10-06 08:13:42|
|2|Page 2 |Tim Starling |2002-10-07 03:00:54|
|2|Page 2 |Salsa Shark |2003-03-18 01:45:32|
并且我想找到在某个时间段内(例如每月)访问该页面的用户数量。例如,对于2002年的10个月,结果将是
|1|Page 1 |Liir |2002-10-12 03:01:42|
|1|Page 1 |Tubby |2002-10-12 03:02:23|
|1|Page 1 |Mojo |2002-10-12 03:18:24|
|1|Page 1 |Kirf |2002-10-12 03:19:03|
|2|Page 2 |Gear |2002-10-04 12:46:06|
|2|Page 2 |Tim Starling |2002-10-06 08:13:42|
|2|Page 2 |Tim Starling |2002-10-07 03:00:54|
和页数:
numberOfUsers (in October 2002)
|1|Page 1 | 4
|2|Page 2 | 3
问题还在于如何在每年的每个月中应用此逻辑。我想出了如何查找例如最近n天的事件
days = lambda i: i * 86400
window = (Window().partitionBy(col("page"))
.orderBy(col("timestamp").cast("timestamp").cast("long")).rangeBetween(-days(30), 0))
df = df.withColumn("monthly_occurrences", func.count("user").over(window))
df.show()
一些建议,我将不胜感激
最佳答案
您可以先创建包含年份-月份组合的列,然后使用该列进行分组。因此,一个可行的示例是:
import pyspark.sql.functions as F
df = sc.parallelize([
('2018-06-02T00:00:00.000Z','tim', 'page 1' ),
('2018-07-20T00:00:00.000Z','tim', 'page 1' ),
('2018-07-20T00:00:00.000Z','john', 'page 2' ),
('2018-07-20T00:00:00.000Z','john', 'page 2' ),
('2018-08-20T00:00:00.000Z','john', 'page 2' )
]).toDF(("datetime","user","page" ))
df = df.withColumn('yearmonth',F.concat(F.year('datetime'),F.lit('-'),F.month('datetime')))
df_agg = df.groupBy('yearmonth','page').count()
df_agg.show()
输出:
+---------+------+-----+
|yearmonth| page|count|
+---------+------+-----+
| 2018-7|page 2| 2|
| 2018-6|page 1| 1|
| 2018-7|page 1| 1|
| 2018-8|page 2| 1|
+---------+------+-----+
希望这可以帮助!
关于python - 每次触发窗口功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51236909/