问题描述
在一个大型数据集(〜1M个案例)中,每个案例都有一个创建和审查 dateTime
。我想计算在每个案例创建时打开的其他案例的数量。案例在他们的创建和被审查之间打开 dataTimes
。
In a large data set (~1M cases), each case has a "created" and a "censored" dateTime
. I want to count the number of other cases that were open at the time each case was created. Cases are open between their "created" and "censored" dataTimes
.
几个解决方案适用于小数据集 20分钟,6 * 4GHz内核和64GB RAM。即使使用多核心图书馆,最多也可以将时间缩短到8或10倍。不足以处理〜1M个案例。
Several solutions work well on small datasets (<100,000 cases), but computation time grows exponentially. My estimate is that computation time increases as a function 3n^2. By n=100,000 cases, computation time is >20 mins on my server with 6 * 4GHz cores & 64GB RAM. Even with multi-core libraries, at best, I could reduce the time by a factor of 8 or 10. Not enough to handle ~1M cases.
我正在寻找为了更有效的方法来做这个计算。下面我提供了一个功能,允许您轻松创建大量的创建和审查 dateTime
对以及迄今为止尝试的两种解决方案,使用 dplyr
和 data.table
库。为了简单起见,向用户报告定时。您可以在顶部更改CASE_COUNT变量,重新执行并重新查看次数,轻松比较您可能需要提出的其他解决方案的时间。
I'm looking for a more efficient method to do this calculation. Below I have provided a function that allows you to easily create large numbers of "created" and "censored" dateTime
pairs along with two solutions tried so far, using dplyr
and data.table
libraries. The timings are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view times again and easily compare the timing of other solutions you might have to suggest.
I将使用其他解决方案更新原始帖子,以便为作者提供适当的信用。提前感谢您的帮助!
I will update the original post with other solutions to give proper credit to their authors. Thanks in advance for your help with this!
# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);
# Set seed for reproducibility
set.seed(123)
# Set number of cases & date range variables
CASE_COUNT <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
RANGE_END <- as.POSIXct("2012-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
# Select which solutions you want to run in this test
RUN_SOLUTION_1 <- TRUE; # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE; # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE; # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE; # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE; # data.table::frank()
# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)
generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
# Measure duration between start_val & end_val
duration <- as.numeric(difftime(end_val, start_val, unit="secs"));
# Select random values in duration to create start_offset
start_offset <- runif(n, 0, duration);
# Calculate the creation time list
created_list <- start_offset + start_val;
# Calculate acceptable time range for censored values
# since they must always be after their respective creation value
censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));
# Select random values in duration to create end_offset
creation_to_censored_times <- runif(n, 0, censored_range);
censored_list <- created_list + creation_to_censored_times;
# Create and return a data.table with creation & censor values
# calculated from start or end with random offsets
return_table <- data.table(id = 1:n,
created = created_list,
censored = censored_list);
return(return_table);
}
# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();
solution_1_function <- function (cases_table) {
# SOLUTION 1: Using dplyr::summarize:
# Group by id to set parameters for summarize() function
cases_table_grouped <- group_by(cases_table, id);
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() within summarize()
cases_table_summary <- summarize(cases_table_grouped,
open_cases_at_creation = sum((cases_table$created < created &
cases_table$censored > created)));
solution_1_table <<- as.data.table(cases_table_summary, key="id");
} # End solution_1_function
solution_2_function <- function (cases_table) {
# SOLUTION 2: Using data.table::foverlaps:
# Adapted from solution provided by @Davidarenburg
# (https://stackoverflow.com/users/3001626/david-arenburg)
# The foverlaps() solution tends to crash R with large case counts
# I suspect it has to do with memory assignment of the very large objects
# It maxes RAM on my system (64GB) before crashing, possibly attempting
# to write beyond its assigned memory limits.
# I'll submit a reproduceable bug to the data.table team since
# foverlaps() is pretty new and known to be occasionally unstable
if (CASE_COUNT > 50000) {
stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
}
setDT(cases_table)[, created_dupe := created];
setkey(cases_table, created, censored);
foverlaps_table <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
cases_table[,c("id","created","censored"), with=FALSE],
by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];
foverlaps_table <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);
solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function
solution_3_function <- function (cases_table) {
# SOLUTION 3: Using data.table aggregation instead of dplyr::summarize
# Idea suggested by @jangorecki
# (https://stackoverflow.com/users/2490497/jangorecki)
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() with data.table aggregation
cases_table_aggregated <- cases_table[order(id), sum((cases_table$created < created &
cases_table$censored > created)),by=id];
solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");
} # End solution_3_function
solution_4_function <- function (cases_table) {
# SOLUTION 4: Using IRanges package
# Adapted from solution suggested by @alexis_laz
# (https://stackoverflow.com/users/2414948/alexis-laz)
# The IRanges package generates ranges efficiently, intended for genome sequencing
# but working perfectly well on this data, since POSIXct values are numeric-representable
solution_4_table <<- data.table(id = cases_table$id,
open_cases_at_creation = countOverlaps(IRanges(cases_table$created,
cases_table$created),
IRanges(cases_table$created,
cases_table$censored))-1, key="id");
} # End solution_4_function
solution_5_function <- function (cases_table) {
# SOLUTION 5: Using data.table::frank()
# Adapted from solution suggested by @danas.zuokas
# (https://stackoverflow.com/users/1249481/danas-zuokas)
n <- CASE_COUNT;
# For every case compute the number of other cases
# with `created` less than `created` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];
# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];
solution_5_table <<- data.table(id = cases_table$id,
open_cases_at_creation = r1 - r2, key="id");
} # End solution_5_function;
# Execute user specified functions;
if (RUN_SOLUTION_1)
solution_1_timing <- system.time(solution_1_function(cases_table));
if (RUN_SOLUTION_2) {
solution_2_timing <- try(system.time(solution_2_function(cases_table)));
cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
solution_3_timing <- system.time(solution_3_function(cases_table));
if (RUN_SOLUTION_4)
solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
solution_5_timing <- system.time(solution_5_function(cases_table));
# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}
same_check <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
same_check2_3, same_check2_4, same_check2_5, same_check3_4,
same_check3_5, same_check4_5);
# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
"It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");
data.table :: foverlaps()
解决方案对于较少的案例( 5,000左右), dplyr :: summarize()
解决方案更快。超过100,000,这两个解决方案都是可行的,因为它们都太慢了。
The data.table::foverlaps()
solution is faster for fewer cases (<5000 or so; depends on randomness in addition to n since it uses binary search to optimize). The dplyr::summarize()
solution is faster for more cases (>5,000 or so). Much beyond 100,000, neither solution is viable as they're both too slow.
编辑:根据@jangorecki建议的使用 data.table
聚合而不是 dplyr :: summarize()
,否则类似于 dplyr
解决方案。最多可达5万例,是最快的解决方案。超过50,000个案例, dplyr :: summarize()
解决方案稍微快一些,但不会太多。不幸的是,对于1M个案例,它仍然不实用。
Added a third solution based on the idea suggested by @jangorecki that uses data.table
aggregation instead of dplyr::summarize()
, and is otherwise similar to the dplyr
solution. For up to around 50,000 cases, it is the fastest solution. Beyond 50,000 cases, the dplyr::summarize()
solution is slightly faster, but not by much. Sadly, for 1M cases it's still not practical.
EDIT2:添加了第四个解决方案,该解决方案由@alexis_laz提出的解决方案进行了改进,该解决方案使用 / code>包及其
countOverlaps
函数。
它比其他3种解决方案快得多。有50,000个案例,比解决方案1& 3。
Added a fourth solution adapted from the solution suggested by @alexis_laz that uses the IRanges
package and its countOverlaps
function.It is significantly faster than the 3 other solutions. With 50,000 cases it was almost 400% faster than solutions 1 & 3.
EDIT3:修改案例生成函数,以适当地执行被删截的条件。感谢@jangorecki抓住以前版本的限制。
Modified case generating function to properly exercise the "censored" condition. Thanks to @jangorecki for catching the limitation of the previous version.
EDIT4:重写以允许用户选择要执行的解决方案并使用系统.time()
用于与每次执行之前的垃圾收集进行时序比较以获得更准确的时间(根据@ jangorecki的精明观察) - 还为崩溃案例添加了一些条件检查。
Rewrote to allow user to select which solutions to execute and to use system.time()
for timing comparison with garbage collection before each execution for more accurate timing (as per @jangorecki's astute observation) - Also added some condition checks for crash cases.
EDIT5:添加了使用 rank()
的@ danas.zuokas建议的解决方案改编的第五个解决方案。我的实验表明,它总是比其他解决方案慢至少一个数量级。在$ code> iranges 解决方案中,对于$ code> dplyr ::总结和0.36秒,需要44秒,而3.5秒。
Added a fifth solution adapted from the solution suggested by @danas.zuokas using rank()
. My experimentation suggests that it's always at least an order of magnitude slower than the other solutions. At 10,000 cases, it takes 44 seconds vs. 3.5 seconds for dplyr::summarize
and 0.36 seconds for IRanges
solutions.
最终编辑:我对@ danas.zuokas建议的解决方案5进行了一些修改,并且通过@Khashaa的观察结果对类型进行了匹配。我在 dataTime
生成函数中设置了 as.numeric
类型,这大大加快了在
对象(提高其他功能的速度,但不会大幅提高)。通过一些测试,设置整数
或双打
而不是 dateTime中执行排名
ties.method ='first'
产生符合意图的结果。 data.table :: frank
比 base :: rank
和 IRanges ::等级
。 bit64 :: rank
是最快的,但它似乎处理不同于 data.table :: frank
的关系,我可以不要按要求处理它们。一旦 bit64
加载,它会屏蔽大量的类型和函数,更改 data.table :: frank
FINAL I've made slight modifications to solution 5 that was suggested by @danas.zuokas, and matching the observation by @Khashaa about types. I've set the type as.numeric
in the dataTime
generation function which drastically speeds up rank
as it operates on integers
or doubles
instead of dateTime
objects(increases speed of other functions too, but not as drastically). With some testing, setting ties.method='first'
yields results consistent with the intent. data.table::frank
is faster than both base::rank
and IRanges::rank
. bit64::rank
is fastest, but it seems to handle ties differently than data.table::frank
and I can't get it to handle them as desired. Once bit64
is loaded, it masks a large number of types and functions, changing the results of data.table::frank
along the way. The specific reasons why are beyond the scope of this question.
POST END注意:结果是 data.table :: frank
有效地处理 POSIXct
dateTimes
,而 base :: rank
或 IRanges :: rank
似乎。因此,即使 as.numeric
(或 as.integer
)类型设置也不是必需的,而 data.table :: frank
并且转换精度没有损失,所以有更少的 ties.method
差异。
谢谢大家谁贡献!我学到了很多!非常感激! :)
信用将被包含在我的源代码中。
POST END NOTE: Turns out that data.table::frank
handles POSIXct
dateTimes
efficiently, whereas neither base::rank
nor IRanges::rank
seem to. As such, even the as.numeric
(or as.integer
) type setting isn't necessary with data.table::frank
and there is no loss of precision from the conversion, so there are fewer ties.method
discrepancies.Thank you to everyone who contributed! I learned a lot! Much appreciated! :)Credit will be included in my source-code.
ENDNOTE:这个问题是一个精致和澄清的版本,更易于使用和更易读的例子代码, - 我把它分开了,不压倒原始帖子编辑太多,并简化了示例代码中大量 dataTime
对的创建。这样,你就不用努力去回答。再次感谢
ENDNOTE: This question is a refined and clarified version, with easier to use and more readable example code, of More efficient method for counting open cases as of creation time of each case - I've separated it here to not overwhelm the original post with too many edits and to simplify the creation of large numbers of dataTime
pairs in the example code. That way, you don't have to work as hard to answer. Thanks again!
推荐答案
答案是由问题的作者对评论的更新。 strong>
The answer is updated with the regards to a comment by the author of the question.
我建议使用排名的解决方案。表的创建方式如,或者在本问题中使用 dateTime
对生成函数。两者都应该工作。
I would suggest a solution using ranks. Tables are created as in a follow up to this question, or using the dateTime
pairs generation function in the present question. Both should work.
n <- cases_table[, .N]
# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
ties.method = 'first')[1:n]
# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
ties.method = 'first')[1:n]
取得差异 r1 - r2
(-1需要使用ties.method ='first')给出结果(消除创建的排名
)。在效率方面,只需在 cases_table
中查找该行数的向量的向量。 data.table :: frank
handle POSIXct
dateTime
快速地作为数字
对象(不同于 base :: rank
),所以不需要类型转换。
Taking difference r1 - r2
(-1 not required with ties.method='first') gives the result (eliminating ranks of created
). In terms of efficiency it takes only finding ranks of a vector of length of that number of rows in cases_table
. data.table::frank
handles POSIXct
dateTime
objects as quickly as numeric
objects (unlike base::rank
), so no type conversion is required.
这篇关于在大型数据集中每次提交案件的时候,开放案件的计数方法是有效的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!