r - 在大数据集中提交每个案件时计算未结案件的有效方法

在大数据集(约100万个案例)中，每个案例都有一个“已创建”和“已审查”的dateTime。我想计算在创建每个案例时打开的其他案例的数量。在“创建”和“审查”的dataTimes之间打开案例。

几种解决方案在小型数据集(小于100,000个案例)上效果很好，但是计算时间却呈指数增长。我的估计是计算时间随着函数3n ^ 2的增加而增加。在n = 100,000情况下，在具有6 * 4GHz内核和64GB RAM的服务器上，计算时间大于20分钟。即使使用多核库，充其量也只能将时间减少8或10倍。不足以处理大约100万个案例。

我正在寻找一种更有效的方法来进行此计算。下面，我提供了一个函数，该函数使您可以使用dateTime和dplyr库轻松创建大量“已创建”和“已审查”的data.table对，以及迄今为止尝试的两种解决方案。为了简单起见，将定时报告给用户。您只需更改顶部的“CASE_COUNT”变量即可重新执行并再次查看时间，并轻松比较您可能需要建议的其他解决方案的时间。

我将使用其他解决方案来更新原始帖子，以适当感谢他们的作者。在此先感谢您的帮助!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00",
                          format="%Y-%m-%d %H:%M:%S",
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00",
                          format="%Y-%m-%d %H:%M:%S",
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) {
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped,
                           open_cases_at_creation = sum((cases_table$created  < created &
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE],
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created &
                                                     cases_table$censored > created)),by=id];

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) {
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created,
                                                                    cases_table$created),
                                                            IRanges(cases_table$created,
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table));
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table));
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

data.table::foverlaps()解决方案在较少的情况下(dplyr::summarize()解决方案在更多情况下(> 5,000左右)更快。远远超过100,000，这两种解决方案都不可行，因为它们都太慢了。

编辑:添加了基于@jangorecki提出的想法的第三种解决方案，该解决方案使用data.table聚合而不是dplyr::summarize()，否则类似于dplyr解决方案。对于大约50,000个案例，这是最快的解决方案。超过50,000个案例，dplyr::summarize()解决方案的速度稍快一些，但幅度不大。可悲的是，对于一百万个案例，这仍然不切实际。

EDIT2:添加了第四个解决方案，该解决方案改编自@alexis_laz建议的解决方案，该解决方案使用IRanges包及其countOverlaps函数。
它比其他3个解决方案快得多。 50,000个案例比解决方案1和3快了近400％。

编辑3:修改了案例生成功能，以适本地行使“审查”条件。感谢@jangorecki catch 了以前版本的限制。

EDIT4:重写以允许用户选择要执行的解决方案，并在每次执行之前使用system.time()与垃圾回收进行时序比较以获得更准确的时序(根据@jangorecki的敏锐观察)-还添加了一些崩溃情况检查条件。

EDIT5:添加了第五个解决方案，改编自@ danas.zuokas使用rank()建议的解决方案。我的实验表明，它总是比其他解决方案至少慢一个数量级。在10,000种情况下，所需的时间为44秒，而dplyr::summarize需要3.5秒，而IRanges解决方案需要0.36秒。

最终编辑:我对@ danas.zuokas建议的解决方案5进行了一些修改，并使@Khashaa对类型的观察与之匹配。我已经在as.numeric生成函数中设置了dataTime类型，由于它在rank或integers而不是doubles对象上运行，因此可以大大加快dateTime的速度(也可以提高其他功能的速度，但不那么快)。通过一些测试，设置ties.method='first'会产生与意图一致的结果。 data.table::frank比base::rank和IRanges::rank都快。 bit64::rank是最快的，但是它似乎与data.table::frank的处理方式不同，我无法按需处理它。加载bit64后，它将屏蔽大量类型和函数，并在此过程中更改data.table::frank的结果。具体原因超出了此问题的范围。

POST END NOTE:事实证明data.table::frank有效地处理了POSIXct dateTimes，而base::rank和IRanges::rank似乎都没有。这样，即使as.numeric甚至不需要as.integer(或data.table::frank)类型设置，转换也不会损失精度，因此ties.method差异更少。
感谢所有贡献者!我学到了很多!非常感激! :)
信用将包含在我的源代码中。

ENDNOTE:这个问题是More efficient method for counting open cases as of creation time of each case的精炼和澄清版本，具有更易于使用和易读的示例代码-我在这里将其分开是为了不使原始文章过多编辑而使您不知所措，并简化了大量dataTime的创建对在示例代码中。这样，您就不必费劲地回答。再次感谢!

最佳答案

答案将根据问题作者对评论的意见进行更新。

我建议使用等级的解决方案。如a follow up to this question中那样创建表，或者在本问题中使用dateTime对生成函数来创建表。两者都应该起作用。

n <- cases_table[, .N]

# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
           ties.method = 'first')[1:n]

# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
           ties.method = 'first')[1:n]

取差异r1 - r2(ties.method ='first'不需要-1)得出结果(消除created的等级)。就效率而言，只需要在cases_table中找到具有该行数的长度的向量的秩即可。 data.table::frank处理POSIXct dateTime对象的速度与numeric对象一样快(与base::rank不同)，因此不需要类型转换。

关于r - 在大数据集中提交每个案件时计算未结案件的有效方法，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34245295/