考虑这个数据集:

> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+                    country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+                    action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+                    signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+                    ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country       action signature_date ratification_date
              1  Canada    signature           2000                NA
             1  Canada ratification             NA              2001
             1     USA    signature           2000                NA
             1     USA ratification             NA              2002
             2  Canada    signature           2001                NA
             2  Canada ratification             NA              2001
             2     USA    signature           2002                NA
             2     USA ratification             NA              2002

如您所见,一半的行有重复的信息。对于像这样的小数据集,删除重复项真的很容易。我可以使用 coalesce 函数( dplyr package ),去掉“action”列,然后删除所有不相关的行。虽然,还有很多其他的方法。最终结果应如下所示:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+                     country = c("Canada", "USA", "Canada","USA"),
+                     signature_date = c(2000,2000,2001,2002),
+                     ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
             1  Canada           2000              2001
             1     USA           2000              2002
             2  Canada           2001              2001
             2     USA           2002              2002

问题是我的真实数据集要大得多(102000 x 270),而且还有更多的变量。真实数据也更不规则,缺失值更多。 coalesce 函数看起来很慢。到目前为止我能做的最好的循环仍然需要 5-10 分钟才能运行。

有没有一种简单的方法可以更快地做到这一点?我有一种感觉,R 中必须有一些函数用于这种操作,但我找不到任何函数。

最佳答案

OP 告诉他,他的生产数据有 10 万行 x 270 列,速度是他关心的问题。因此,我建议使用 data.table
我知道 Harland 也建议使用 data.tabledcast() 但下面的解决方案是一种不同的方法。它以正确的顺序引入行并将 ratification_date 复制到签名行。经过一些清理,我们得到了想要的结果。

library(data.table)

# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]

# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)

# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
                by = c("Agreement_number", "country")][
                  # keep only signature rows, remove action column
                  action == "signature"][, action := NULL]
result

数据
OP 提到他的生产数据有 270 列。为了模拟这一点,我添加了两个虚拟列:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
请注意,set.seed() 用于采样时可重复的结果。

附录:带有附加列的 dcast() Harland 建议使用 data.tabledcast() 。除了他的回答中的其他几个缺陷外,它还没有处理 OP 提到的其他列。
下面的 dcast() 方法还将返回附加列:
library(data.table)

# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]

# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]

# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
                value.var = "signature_date")
result

请注意,此方法将更改列的顺序。

关于r - 如何避免大数据集的缓慢循环?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45427161/

10-12 17:16
查看更多