尝试转换以下R data.frame:
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame")
Time Price Volume
1 09:30:01 1 100
2 09:30:29 2 200
3 09:35:56 3 300
4 09:37:17 4 100
5 09:37:21 5 200
6 09:37:28 6 300
7 09:37:35 7 100
8 09:37:51 8 200
9 09:42:11 9 600
10 10:00:31 10 100
进入这个
Time Price Volume Bin
1 09:30:01 1 100 1
2 09:30:29 2 200 1
3 09:35:56 3 200 1
4 09:35:56 3 100 2
5 09:37:17 4 100 2
6 09:37:21 5 200 2
7 09:37:28 6 100 2
8 09:37:28 6 200 3
9 09:37:35 7 100 3
10 09:37:51 8 200 3
11 09:42:11 9 500 4
12 09:42:11 9 100 5
13 10:00:31 10 100 5
本质上,它是计算数量的累积总和,每次违反500时就进行分箱。因此,bin 1是100 + 200 + 200,其在09:35:56处的音量分为200/100,并插入了新行,并且bin计数器增加了。
对于基数R,这相对简单,但我想知道dplyr是否有一种更优雅,更希望更快的方法。
干杯
更新:
谢谢@Frank和@AntoniosK。
为了解决您的问题,音量值的范围是从1到10k的所有正整数值。
我对这两种方法都进行了微基准测试,在与上述类似的具有约200k行的数据集上,dplyr的速度稍快一些,但并不多。
非常感谢您的迅速 react 和协助
最佳答案
不知道这是最好还是最快的方法,但是对于那些Volume
值来说似乎很快。哲学很简单。根据Volume
值,您可以使用Time
创建许多行Price
和Volume = 1
。然后,让cumsum
在您有500个新批次时添加数字并标记。使用这些标志来创建Bin
值。
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame") -> dt
library(dplyr)
dt %>%
group_by(Time, Price) %>% ## for each Time and Price
do(data.frame(Volume = rep(1,.$Volume))) %>% ## create as many rows, with Volume = 1, as the value of Volume
ungroup() %>% ## forget about the grouping
mutate(CumSum = cumsum(Volume), ## cumulative sums
flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0), ## flag 500 batches (at 501, 1001, etc.)
Bin = cumsum(flag_500)+1) %>% ## create Bin values
group_by(Bin, Time, Price) %>% ## for each Bin, Time and Price
summarise(Volume = sum(Volume)) %>% ## get new Volume values
select(Time, Price, Volume, Bin) %>% ## use only if you want to re-arrange column order
ungroup() ## use if you want to forget the grouping
# Time Price Volume Bin
# (chr) (dbl) (dbl) (dbl)
# 1 09:30:01 1 100 1
# 2 09:30:29 2 200 1
# 3 09:35:56 3 200 1
# 4 09:35:56 3 100 2
# 5 09:37:17 4 100 2
# 6 09:37:21 5 200 2
# 7 09:37:28 6 100 2
# 8 09:37:28 6 200 3
# 9 09:37:35 7 100 3
# 10 09:37:51 8 200 3
# 11 09:42:11 9 500 4
# 12 09:42:11 9 100 5
# 13 10:00:31 10 100 5