本文介绍了数据表中的条件差计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个百万行长的data.table约20个反式样列。这些列显示各种存储系统操作的计数器在时间上增加。有时候,计数器在被监视的系统上被复位,单个观测值低于上一个。 我需要计算一个 opsdiff 列,其将包含基于列 type 的相同类型的后续值的算术差(预期为正)。当识别计数器复位情况时 - 即。差值为负,应使用计数器的实际值。 > dt & dt type opscounter 1:a 105609 2:a 106316 3:a 106705 4:a 489 5:a 723 6:a 1250 7:b 105609 8:b 106316 9:b 106705 10:b 489 11:b 723 12 :b 1250 我想得到的结果: $ b b > dt type opscounter opsdiff 1:a 105609 NA 2:a 106316 707 3:a 106705 389 4:a 489 489 5: a 723 234 6:a 1250 527 7:b 105609 NA 8:b 106316 707 9:b 106705 389 10:b 489 489 11:b 723 234 12:b 1250 527 > 我试图构建一个基于 ifelse $ c>,但我不知道如何解决当前行的 opscounter 在 else 部分中的值,也不知道 diff()可能是一个有效的用法: dt [,opsdiff:= c NA,ifelse(diff(opscounter)> 0,diff(opscounter),opscounter)),by = type] $ b b 如何正确计算 opsdiff 列? 问候, Lukas 解决方案首选解决方案是: dt [,opsdiff:= c(NA,diff(opscounter)), by = type] [opsdiff< 0,opsdiff:= opscounter] [] #type opscounter opsdiff #1:a 105609 NA #2:a 106316 707 #3:a 106705 389 #4:a 489 489 #5:a 723 234 #6:a 1250 527 #7:b 105609 NA #8:b 106316 707 #9:b 106705 389 #10:b 489 489 #11:b 723 234 #12:b 1250 527 pre> 一般来说,最好避免 ifelse (特别是在您使用这种大数据集的情况下)为可以缓慢(虽然向量化),因为它评估了是和无情况。在你的情况下,你发现另一个缺陷,你需要告诉它确切的位置,你想从 opscounter ,这增加了复杂性(见 @可以覆盖的Arwr注释)。 关于你的问题,在 DT [...] 操作的 data.table $ c>只是调用 [。data.table(DT,...)。在 data.frame 上没有任何区别;有一个类似的函数 [。data.frame 。 为了使它更清晰,在data.table中,tacks [...] 一个接一个称为链接。这是免费的。您也可以在data.frame中执行相同操作(如下所示),但是您可以对 data.frame 执行的操作是有限的,因此使用与数据表不同。 df df [df $ type ==a,] [2:3,] #type opscounter #2 a 106316 #3 a 106705 为了说明 ifelse 的无效性,这里是一个基准: set.seed(123)n< - 1e6 dt< - data.table(type = rep(c(a,b),each = n), opscounter = sample(1:1e5,n * 2,replace = TRUE)) 库(microbenchmark) microbenchmark( dt [,opsdiff:= c (NA,diff(opscounter)),by = type] [opsdiff dt [,opsdiff:= c(NA,ifelse (opscounter),tail(opscounter,-1L))),by = type] ) #单位:毫秒#expr # :=`(opsdiff,c(NA,diff(opscounter)))by = type] [opsdiff #dt [,`:=`(opsdiff,c(NA,ifelse(diff(opscounter) -1L))),by = type] #min lq mean median uq max neval #228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100 #899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100 ifelse 解决方案的速度慢了〜4 。 I have a millions row long data.table with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.I need to calculate a opsdiff column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on column type. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))> dt type opscounter 1: a 105609 2: a 106316 3: a 106705 4: a 489 5: a 723 6: a 1250 7: b 105609 8: b 106316 9: b 10670510: b 48911: b 72312: b 1250The result I wish to get:> dt type opscounter opsdiff 1: a 105609 NA 2: a 106316 707 3: a 106705 389 4: a 489 489 5: a 723 234 6: a 1250 527 7: b 105609 NA 8: b 106316 707 9: b 106705 38910: b 489 48911: b 723 23412: b 1250 527> I tried to build a construction based on ifelse() but I don't know how to address the current row's opscounter value in else part and neither the dual call of diff() is perhaps an efficient usage:dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]How can I correctly calculate the opsdiff column?Greetings,Lukas 解决方案 The preferred solution would be:dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter][]# type opscounter opsdiff# 1: a 105609 NA# 2: a 106316 707# 3: a 106705 389# 4: a 489 489# 5: a 723 234# 6: a 1250 527# 7: b 105609 NA# 8: b 106316 707# 9: b 106705 389# 10: b 489 489# 11: b 723 234# 12: b 1250 527In general, it would be better to avoid ifelse (especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pull opscounter from, which adds to the complexity (see @Aruns comment for possible override).Regarding your question in comments, a data.table operation of the form DT[...] just calls the function [.data.table(DT, ...). It's not any different on a data.frame; there is a similar function [.data.frame. To make it even clearer, in data.table, tacking [...] one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on a data.frame are limited, and therefore the use of chaining itself, unlike data.table.df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by referencedf[df$type == "a", ][2:3, ]# type opscounter# 2 a 106316# 3 a 106705Finally, in order to illustrate the ineffectiveness of ifelse, here is a benchmark:set.seed(123)n <- 1e6dt <- data.table(type = rep(c("a","b"), each = n), opscounter = sample(1:1e5, n*2, replace = TRUE))library(microbenchmark)microbenchmark(dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter],dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type])# Unit: milliseconds# expr# dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff < 0, `:=`(opsdiff, opscounter)]# dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type]# min lq mean median uq max neval# 228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100# 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100The ifelse solution is slower by a factor of ~4. 这篇关于数据表中的条件差计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-24 21:11