在大数据集(约100万个案例)中,每个案例都有一个“已创建”和“已审查”的dateTime
。我想计算在创建每个案例时打开的其他案例的数量。在“创建”和“审查”的dataTimes
之间打开案例。
几种解决方案在小型数据集(小于100,000个案例)上效果很好,但是计算时间却呈指数增长。我的估计是计算时间随着函数3n ^ 2的增加而增加。在n = 100,000情况下,在具有6 * 4GHz内核和64GB RAM的服务器上,计算时间大于20分钟。即使使用多核库,充其量也只能将时间减少8或10倍。不足以处理大约100万个案例。
我正在寻找一种更有效的方法来进行此计算。下面,我提供了一个函数,该函数使您可以使用dateTime
和dplyr
库轻松创建大量“已创建”和“已审查”的data.table
对,以及迄今为止尝试的两种解决方案。为了简单起见,将定时报告给用户。您只需更改顶部的“CASE_COUNT”变量即可重新执行并再次查看时间,并轻松比较您可能需要建议的其他解决方案的时间。
我将使用其他解决方案来更新原始帖子,以适当感谢他们的作者。在此先感谢您的帮助!
# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);
# Set seed for reproducibility
set.seed(123)
# Set number of cases & date range variables
CASE_COUNT <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
RANGE_END <- as.POSIXct("2012-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
# Select which solutions you want to run in this test
RUN_SOLUTION_1 <- TRUE; # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE; # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE; # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE; # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE; # data.table::frank()
# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)
generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
# Measure duration between start_val & end_val
duration <- as.numeric(difftime(end_val, start_val, unit="secs"));
# Select random values in duration to create start_offset
start_offset <- runif(n, 0, duration);
# Calculate the creation time list
created_list <- start_offset + start_val;
# Calculate acceptable time range for censored values
# since they must always be after their respective creation value
censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));
# Select random values in duration to create end_offset
creation_to_censored_times <- runif(n, 0, censored_range);
censored_list <- created_list + creation_to_censored_times;
# Create and return a data.table with creation & censor values
# calculated from start or end with random offsets
return_table <- data.table(id = 1:n,
created = created_list,
censored = censored_list);
return(return_table);
}
# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();
solution_1_function <- function (cases_table) {
# SOLUTION 1: Using dplyr::summarize:
# Group by id to set parameters for summarize() function
cases_table_grouped <- group_by(cases_table, id);
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() within summarize()
cases_table_summary <- summarize(cases_table_grouped,
open_cases_at_creation = sum((cases_table$created < created &
cases_table$censored > created)));
solution_1_table <<- as.data.table(cases_table_summary, key="id");
} # End solution_1_function
solution_2_function <- function (cases_table) {
# SOLUTION 2: Using data.table::foverlaps:
# Adapted from solution provided by @Davidarenburg
# (https://stackoverflow.com/users/3001626/david-arenburg)
# The foverlaps() solution tends to crash R with large case counts
# I suspect it has to do with memory assignment of the very large objects
# It maxes RAM on my system (64GB) before crashing, possibly attempting
# to write beyond its assigned memory limits.
# I'll submit a reproduceable bug to the data.table team since
# foverlaps() is pretty new and known to be occasionally unstable
if (CASE_COUNT > 50000) {
stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
}
setDT(cases_table)[, created_dupe := created];
setkey(cases_table, created, censored);
foverlaps_table <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
cases_table[,c("id","created","censored"), with=FALSE],
by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];
foverlaps_table <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);
solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function
solution_3_function <- function (cases_table) {
# SOLUTION 3: Using data.table aggregation instead of dplyr::summarize
# Idea suggested by @jangorecki
# (https://stackoverflow.com/users/2490497/jangorecki)
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() with data.table aggregation
cases_table_aggregated <- cases_table[order(id), sum((cases_table$created < created &
cases_table$censored > created)),by=id];
solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");
} # End solution_3_function
solution_4_function <- function (cases_table) {
# SOLUTION 4: Using IRanges package
# Adapted from solution suggested by @alexis_laz
# (https://stackoverflow.com/users/2414948/alexis-laz)
# The IRanges package generates ranges efficiently, intended for genome sequencing
# but working perfectly well on this data, since POSIXct values are numeric-representable
solution_4_table <<- data.table(id = cases_table$id,
open_cases_at_creation = countOverlaps(IRanges(cases_table$created,
cases_table$created),
IRanges(cases_table$created,
cases_table$censored))-1, key="id");
} # End solution_4_function
solution_5_function <- function (cases_table) {
# SOLUTION 5: Using data.table::frank()
# Adapted from solution suggested by @danas.zuokas
# (https://stackoverflow.com/users/1249481/danas-zuokas)
n <- CASE_COUNT;
# For every case compute the number of other cases
# with `created` less than `created` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];
# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];
solution_5_table <<- data.table(id = cases_table$id,
open_cases_at_creation = r1 - r2, key="id");
} # End solution_5_function;
# Execute user specified functions;
if (RUN_SOLUTION_1)
solution_1_timing <- system.time(solution_1_function(cases_table));
if (RUN_SOLUTION_2) {
solution_2_timing <- try(system.time(solution_2_function(cases_table)));
cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
solution_3_timing <- system.time(solution_3_function(cases_table));
if (RUN_SOLUTION_4)
solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
solution_5_timing <- system.time(solution_5_function(cases_table));
# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}
same_check <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
same_check2_3, same_check2_4, same_check2_5, same_check3_4,
same_check3_5, same_check4_5);
# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
"It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");
data.table::foverlaps()
解决方案在较少的情况下(dplyr::summarize()解决方案在更多情况下(> 5,000左右)更快。远远超过100,000,这两种解决方案都不可行,因为它们都太慢了。编辑:添加了基于@jangorecki提出的想法的第三种解决方案,该解决方案使用
data.table
聚合而不是dplyr::summarize()
,否则类似于dplyr
解决方案。对于大约50,000个案例,这是最快的解决方案。超过50,000个案例,dplyr::summarize()
解决方案的速度稍快一些,但幅度不大。可悲的是,对于一百万个案例,这仍然不切实际。EDIT2:添加了第四个解决方案,该解决方案改编自@alexis_laz建议的解决方案,该解决方案使用
IRanges
包及其countOverlaps
函数。它比其他3个解决方案快得多。 50,000个案例比解决方案1和3快了近400%。
编辑3:修改了案例生成功能,以适本地行使“审查”条件。感谢@jangorecki catch 了以前版本的限制。
EDIT4:重写以允许用户选择要执行的解决方案,并在每次执行之前使用
system.time()
与垃圾回收进行时序比较以获得更准确的时序(根据@jangorecki的敏锐观察)-还添加了一些崩溃情况检查条件。EDIT5:添加了第五个解决方案,改编自@ danas.zuokas使用
rank()
建议的解决方案。我的实验表明,它总是比其他解决方案至少慢一个数量级。在10,000种情况下,所需的时间为44秒,而dplyr::summarize
需要3.5秒,而IRanges
解决方案需要0.36秒。最终编辑:我对@ danas.zuokas建议的解决方案5进行了一些修改,并使@Khashaa对类型的观察与之匹配。我已经在
as.numeric
生成函数中设置了dataTime
类型,由于它在rank
或integers
而不是doubles
对象上运行,因此可以大大加快dateTime
的速度(也可以提高其他功能的速度,但不那么快)。通过一些测试,设置ties.method='first'
会产生与意图一致的结果。 data.table::frank
比base::rank
和IRanges::rank
都快。 bit64::rank
是最快的,但是它似乎与data.table::frank
的处理方式不同,我无法按需处理它。加载bit64
后,它将屏蔽大量类型和函数,并在此过程中更改data.table::frank
的结果。具体原因超出了此问题的范围。POST END NOTE:事实证明
data.table::frank
有效地处理了POSIXct
dateTimes
,而base::rank
和IRanges::rank
似乎都没有。这样,即使as.numeric
甚至不需要as.integer
(或data.table::frank
)类型设置,转换也不会损失精度,因此ties.method
差异更少。感谢所有贡献者!我学到了很多!非常感激! :)
信用将包含在我的源代码中。
ENDNOTE:这个问题是More efficient method for counting open cases as of creation time of each case的精炼和澄清版本,具有更易于使用和易读的示例代码-我在这里将其分开是为了不使原始文章过多编辑而使您不知所措,并简化了大量
dataTime
的创建对在示例代码中。这样,您就不必费劲地回答。再次感谢! 最佳答案
答案将根据问题作者对评论的意见进行更新。
我建议使用等级的解决方案。如a follow up to this question中那样创建表,或者在本问题中使用dateTime
对生成函数来创建表。两者都应该起作用。
n <- cases_table[, .N]
# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
ties.method = 'first')[1:n]
# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
ties.method = 'first')[1:n]
取差异
r1 - r2
(ties.method ='first'不需要-1)得出结果(消除created
的等级)。就效率而言,只需要在cases_table
中找到具有该行数的长度的向量的秩即可。 data.table::frank
处理POSIXct
dateTime
对象的速度与numeric
对象一样快(与base::rank
不同),因此不需要类型转换。关于r - 在大数据集中提交每个案件时计算未结案件的有效方法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34245295/