本文介绍了从每个案例的创作时间开始计数开放案件的更有效的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种更有效的方式来计算每个案例的创建时间内打开的案例数。一个案例在其创建日期/时间戳和其审查日期/时间戳之间是开放的。您可以复制下面的代码来查看一个简单的功能示例:

I am trying find a more efficient way to count the number of cases that are open as of the creation time of each case. A case is "open" between its creation date/time stamp and its censor date/time stamp. You can copy-paste the code below to view a simple functional example:

# Create a bunch of date/time stamps for our example
two_thousand                <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_one            <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_two            <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_three          <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_four           <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_five           <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_six            <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_seven          <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eight          <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_nine           <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_ten            <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eleven         <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

mid_two_thousand            <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_one        <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_mid_two    <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_three      <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_four       <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_five       <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_six        <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_seven      <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eight      <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_nine       <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_ten        <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eleven     <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

# Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open"
comparison_table    <- data.table(id        = 1:10,
                                  created   = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one),
                                  censored  = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four));

# Create a table that has the creation date/time stamps at which we want to count all the open cases
check_table         <- data.table(id        = 1:12,
                                  creation  = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); 

# I use the DPLYR library as the group_by() + summarize() functions make this operation simple
library(dplyr);

# Group by id to set parameter for summarize() function 
check_table_grouped <- group_by(check_table, id);

# For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table
# EDIT: Also added timing to compare with method below
system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation))));

# Result is as desired
check_table_summary;              

# EDIT: Added @David-arenburg's solution with timing
library(data.table);
setDT(check_table)[, creation2 := creation];
setkey(comparison_table, created, censored);
system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]);

# Same results as above
foverlaps_table;

此方法适用于像本例中的小数据集。然而,即使我使用矢量化的操作,计算时间也呈指数增长,因为运算计数是:(3 * nrow比较)*(nrow sum(nrow)计算)。 nrow = 10,000,时间约14s,nrow = 100,000,时间> 20分钟。我的实际需要是〜1,000,000。

This approach works fine for small data sets like the one in this example. However, even though I'm using vectorized operations, the computation time grows exponentially because the operation count is: (3 * nrow comparisons) * (nrow sum(nrow) calculations). At nrow=10,000, time is around 14s, at nrow=100,000, time is > 20 minutes. My actual nrow is ~ 1,000,000.

有更有效的方法来做这个计算吗?我目前正在研究多核选项,但即使那些只会线性地减少执行时间。您的帮助不胜感激谢谢!

Is there a more efficient way to do this calculation? I'm currently looking into multicore options, but even those will only linearly reduce the execution time. Your help is appreciated. Thanks!

编辑:添加@ David-arenburg的 data.table :: foverlaps 解决方案,这也是正常的更快的nrow<但是,对于较大数量的行,它比总结解决方案慢。 10,000行是两倍长。在五万行,我放弃了等待10倍以上。有趣的是, foverlaps 解决方案似乎没有触发自动垃圾收集,所以不断地坐在最大RAM(我的系统上为64GB),而总结解决方案定期触发自动垃圾收集,所以永远不要超过〜40GB的RAM。我不确定这是否与速度差异有关。

Added @David-arenburg's data.table::foverlaps solution, which also works and is faster for nrow < 1000. However, it is slower than the summarize solution for larger numbers of rows. At 10,000 rows it was twice as long. At 50,000 rows, I gave up waiting after 10x as long. Interestingly, the foverlaps solution doesn't seem to trigger automatic garbage collection, so constantly sits at max RAM (64GB on my system) whereas the summarize solution periodically triggers the automatic garbage collection, so never exceeds ~ 40GB of RAM. I'm unsure if this is related to the speed differences.

最终编辑:我重新编写了一个问题,使受访者更容易生成大桌子与合适的创建/审查日期时间。我还简化了问题,并更清楚地解释了这个问题,清楚的是查询表非常大(违反了 data.table :: foverlaps 假设)。我甚至建立了时序比较,使其超级简单的大案例测试!详细信息:

FINAL I have re-written the question in a way that makes it much easier for respondents to generate large tables with suitable created/censored dateTimes. I've also simplified and explained the problem more clearly, making it clear that the lookup table is very large (violating data.table::foverlaps assumptions). I've even built in timing comparison to make it super simple to test with large cases! Details here: Efficient method for counting open cases at time of each case's submission in large data set

再次感谢您的帮助!

推荐答案

另一个 foverlaps 解决方案。假设 comparison_table 不太大

Yet another foverlaps solution. Assuming that comparison_table is not too large

library(data.table);
setkey(comparison_table, created, censored);    
times <- sort(unique(c(comparison_table$created, comparison_table$censored)))
dt <- data.table(creation=times+1)[, creation2 := creation];
setkey(dt, creation, creation2)
x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N
check_table$newcol <- x[findInterval(check_table$creation, times)]

这篇关于从每个案例的创作时间开始计数开放案件的更有效的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 17:42