尝试转换以下R data.frame:

    structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
            Price=c(1,2,3,4,5,6,7,8,9,10),
            Volume=c(100,200,300,100,200,300,100,200,600,100)),
      .Names = c("Time", "Price", "Volume"),
      row.names = c(NA,10L),
      class = "data.frame")

           Time Price Volume
    1  09:30:01     1    100
    2  09:30:29     2    200
    3  09:35:56     3    300
    4  09:37:17     4    100
    5  09:37:21     5    200
    6  09:37:28     6    300
    7  09:37:35     7    100
    8  09:37:51     8    200
    9  09:42:11     9    600
    10 10:00:31    10    100

进入这个
       Time Price  Volume Bin
1  09:30:01     1     100   1
2  09:30:29     2     200   1
3  09:35:56     3     200   1
4  09:35:56     3     100   2
5  09:37:17     4     100   2
6  09:37:21     5     200   2
7  09:37:28     6     100   2
8  09:37:28     6     200   3
9  09:37:35     7     100   3
10 09:37:51     8     200   3
11 09:42:11     9     500   4
12 09:42:11     9     100   5
13 10:00:31    10     100   5

本质上,它是计算数量的累积总和,每次违反500时就进行分箱。因此,bin 1是100 + 200 + 200,其在09:35:56处的音量分为200/100,并插入了新行,并且bin计数器增加了。

对于基数R,这相对简单,但我想知道dplyr是否有一种更优雅,更希望更快的方法。

干杯

更新:

谢谢@Frank和@AntoniosK。

为了解决您的问题,音量值的范围是从1到10k的所有正整数值。

我对这两种方法都进行了微基准测试,在与上述类似的具有约200k行的数据集上,dplyr的速度稍快一些,但并不多。

非常感谢您的迅速 react 和协助

最佳答案

不知道这是最好还是最快的方法,但是对于那些Volume值来说似乎很快。哲学很简单。根据Volume值,您可以使用Time创建许多行PriceVolume = 1。然后,让cumsum在您有500个新批次时添加数字并标记。使用这些标志来创建Bin值。

structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
                Price=c(1,2,3,4,5,6,7,8,9,10),
                Volume=c(100,200,300,100,200,300,100,200,600,100)),
          .Names = c("Time", "Price", "Volume"),
          row.names = c(NA,10L),
          class = "data.frame") -> dt

library(dplyr)

dt %>%
  group_by(Time, Price) %>%                     ## for each Time and Price
  do(data.frame(Volume = rep(1,.$Volume))) %>%  ## create as many rows, with Volume = 1, as the value of Volume
  ungroup() %>%                                 ## forget about the grouping
  mutate(CumSum = cumsum(Volume),               ## cumulative sums
         flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0),  ## flag 500 batches (at 501, 1001, etc.)
         Bin = cumsum(flag_500)+1) %>%          ## create Bin values
  group_by(Bin, Time, Price) %>%                ## for each Bin, Time and Price
  summarise(Volume = sum(Volume)) %>%           ## get new Volume values
  select(Time, Price, Volume, Bin) %>%          ## use only if you want to re-arrange column order
  ungroup()                                     ## use if you want to forget the grouping

#        Time Price Volume   Bin
#       (chr) (dbl)  (dbl) (dbl)
# 1  09:30:01     1    100     1
# 2  09:30:29     2    200     1
# 3  09:35:56     3    200     1
# 4  09:35:56     3    100     2
# 5  09:37:17     4    100     2
# 6  09:37:21     5    200     2
# 7  09:37:28     6    100     2
# 8  09:37:28     6    200     3
# 9  09:37:35     7    100     3
# 10 09:37:51     8    200     3
# 11 09:42:11     9    500     4
# 12 09:42:11     9    100     5
# 13 10:00:31    10    100     5

10-06 05:21
查看更多