问题描述
我的数据包含时间变量和选择的品牌变量,如下所示.time 表示购物时间,choosebrand 表示当时购买的品牌.
使用此数据,我想创建排名变量,如第三列、第四列等所示.
品牌排名(例如,品牌 1 - 品牌 3)应基于过去 36 小时.因此,要计算商店时间为 "2013-09-01 08:54:00 UTC"
的第二行的排名,排名应基于所有 chosenbrand
时间前 36 小时内的值.(第二排的brand1
不应在36小时内)
因此,rank_brand1、rank_brand2、rank_brand3、rank_bran4,,,是我想要的变量.
如果我想创建 rank_brand5、rank_brand6 以及...
有什么简单的方法吗?
另外,如果我想一个人做(如果每个客户都有好几个购买记录),怎么做?
数据如下,
shoptime selectedbrand rank_brand1 rank_brand2 rank_brand3, ...2013-09-01 08:35:00 UTC 品牌1 NA NA NA2013-09-01 08:54:00 UTC 品牌1 1 NA NA2013-09-01 09:07:00 UTC 品牌2 1 2 NA2013-09-01 09:08:00 UTC 品牌3 1 2 32013-09-01 09:11:00 UTC 品牌5 1 2 32013-09-01 09:14:00 UTC 品牌2 1 2 32013-09-01 09:26:00 UTC 品牌6 1 1 32013-09-01 09:26:00 UTC 品牌2 1 1 32013-09-01 09:29:00 UTC 品牌2 2 1 32013-09-01 09:32:00 UTC 品牌4 2 1 3
这里是数据的代码
dat
这是一个棘手的问题.下面的解决方案使用 non-equi joins 以 36 小时为单位聚合,dcast()
将长格式重塑为宽格式,然后与原始 进行第二次连接数据
.可以有任意数量的品牌.
library(data.table)图书馆(润滑)setDT(dat)[, shoptime := as_datetime(shoptime)]setorder(dat, shoptime) # 不需要,只是为了方便观察者dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime
shoptime 品牌1 品牌2 品牌3 品牌5 品牌6 品牌1: 2013-09-01 08:35:00 NA NA NA NA NA 品牌12:2013-09-01 08:54:00 1 NA NA NA NA 品牌13:2013-09-01 09:07:00 1 NA NA NA NA 品牌24: 2013-09-01 09:08:00 1 2 NA NA NA 品牌35: 2013-09-01 09:11:00 1 2 2 NA NA 品牌56: 2013-09-01 09:14:00 1 2 2 2 北美品牌27: 2013-09-01 09:26:00 1 1 2 2 北美品牌68: 2013-09-01 09:26:00 1 1 2 2 北美品牌29: 2013-09-01 09:29:00 2 1 3 3 3 品牌210: 2013-09-01 09:32:00 2 1 3 3 3 品牌4
说明
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime
返回每 36 小时的汇总结果:
shoptime shoptime 品牌N等级1: 2013-08-30 20:54:00 2013-09-01 08:54:00 品牌1 1 12: 2013-08-30 21:07:00 2013-09-01 09:07:00 品牌1 2 13: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌1 2 14: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌2 1 25: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌1 2 16: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌2 1 27: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌3 1 28: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌1 2 19: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌2 1 210: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌3 1 211: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌5 1 212: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 113: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 114: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 215: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 216: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 117: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 118: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 219: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 220: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌1 2 221: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌2 3 122: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌3 1 323: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌5 1 324: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌6 1 325: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌1 2 226: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌2 4 127: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌3 1 328: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌5 1 329: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌6 1 3shoptime shoptime 品牌 N 级
然后,将这个中间结果从长格式改成宽格式:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime
shoptime 品牌1 品牌2 品牌3 品牌5 品牌61: 2013-09-01 08:54:00 1 NA NA NA NA2: 2013-09-01 09:07:00 1 NA NA NA NA3: 2013-09-01 09:08:00 1 2 不适用 不适用4: 2013-09-01 09:11:00 1 2 2 不适用 不适用5: 2013-09-01 09:14:00 1 2 2 2 不适用6: 2013-09-01 09:26:00 1 1 2 2 不适用7: 2013-09-01 09:29:00 2 1 3 3 38: 2013-09-01 09:32:00 2 1 3 3 3
与原始 dat
数据框的最终右连接完成缺失的行和列(参见上面的代码和结果).
数据
dat
My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.
With this data, I would like to create rank variable as shown third column, fourth column, and so on.
The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC"
the rank should be based on all chosenbrand
values within 36 hours before the time. (brand1
in second row should not be in the 36 hours)
Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.
If I want to create rank_brand5, rank_brand6 as well...
Is there any simple way?
In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?
Data is as below,
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3, ...
2013-09-01 08:35:00 UTC brand1 NA NA NA
2013-09-01 08:54:00 UTC brand1 1 NA NA
2013-09-01 09:07:00 UTC brand2 1 2 NA
2013-09-01 09:08:00 UTC brand3 1 2 3
2013-09-01 09:11:00 UTC brand5 1 2 3
2013-09-01 09:14:00 UTC brand2 1 2 3
2013-09-01 09:26:00 UTC brand6 1 1 3
2013-09-01 09:26:00 UTC brand2 1 1 3
2013-09-01 09:29:00 UTC brand2 2 1 3
2013-09-01 09:32:00 UTC brand4 2 1 3
Here is code for data
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast()
to reshape from long to wide format, and a second join with the original dat
. There can be an arbitrary number of brands.
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
dat, on = "shoptime"]
Explanation
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]
returns the aggregated results per 36 hours periods:
Then, this intermediate result is reshaped from long to wide format:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]
The final right join with the original dat
data frame completes the missing rows and columns (see code and result above).
Data
dat <- data.frame(
shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
这篇关于如何在特定条件下创建等级变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!