通常,我会通过运行循环来解决此类问题(可能仍然不是最佳解决方案),但是我正在处理一个非常大的数据集(780万观察值),并且我一直在尝试对其进行更有效的编程。这是我的数据集的一个很小的子集:
df = data.frame(STATE = c("PA", "PA", "MD","MD", "MO", "MO"),
DIVISION = c("Middle_Atlantic", "Middle_Atlantic","South_Atlantic","South_Atlantic","West_North_Central","West_North_Central"),
Middle_Atlantic_NSA = c(117.77, 119.43, 119.43, 120.72, 119.11, 117.77),
Middle_Atlantic_SA = c(118.45, 119.65, 119.65, 120.73, 119, 118.45),
South_Atlantic_NSA = c(134.45, 135.2, 135.2, 136.69, 134.07, 134.45),
South_Atlantic_SA = c(134.25, 134.83, 134.83, 135.97, 133.86, 134.25),
West_North_Central_NSA=c(152.24, 153.61, 153.61, 155.19, 151.08, 152.24),
West_North_Central_SA=c(152.77, 153.19, 153.19, 154.44, 151.63, 152.77),
DIV_HPI_NSA = c(117.77, 119.43, 135.2, 136.69, 151.08, 152.24),
DIV_HPI_SA = c(118.45, 119.65, 134.83, 135.97, 151.63, 152.77))
我已经为变量“ DIV_HPI_NSA”和“ DIV_HPI_SA”提供了所需的输出。我要完成的工作是在后缀“ _NDA”上查找“ DIVISION”(例如“ Middle_Atlantic”)中的值,并将该变量(在本例中为“ Middle_Atlantic”)的相应值返回到新变量“ DIV_HPI_NSA”。我正在为“ DIV_HPI_SA”变量做同样的事情。当前,我正在尝试使用get()函数或eval(parse(text =“ text_here”))方法将字符串评估为列名并生成正确的值,但是它们对我而言并不理想。理想情况下,我希望使用dplyr解决方案,因为它相对于循环而言处理速度相对较快。我不确定为什么这在dplyr中不起作用,并且想了解为什么以及如何成功执行它。这是颜色协调的所需输出的屏幕截图。
这是我当前的代码:
comb.df = df %>%
mutate(DIV_HPI_NSA = get(paste0(DIVISION,"_NSA")),
DIV_HPI_SA = eval(parse(text = (paste0(DIVISION,"_SA")))))
这就是我要通过循环执行的方式-可以产生正确的结果,但要花费大量的时间:
for(i in 1:dim(comb.df)[1]){
comb.df$DIV_HPI_NSA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_NSA")]
comb.df$DIV_HPI_SA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_SA")]
}
我当前的输出(即DIV_HPI_NSA)继续提供与“ DIVISION”列中评估的第一个元素相对应的列输出。例如,“ DIV_HPI_NSA”的dplyr方法仅返回“ Middle_Atlantic_NSA”列中的值,因为它是“ DIVISION”中的第一个元素。 eval()也有相同的问题,并且不会生成正确的行输出。
有没有比dplyr更好/更快的方法,和/或如何修复dplyr代码以使其正常工作?
如果您需要其他信息,请告诉我。
提前致谢!
最佳答案
答案可能取决于DIVISION
可以取值的数量。
这是一个仅包含“ _NSA”的基准测试,但是显然以后您可以使用“ _SAA”进行相同的测试。
#your base function in a for loop
x1 = function(db){
for(i in 1:dim(db)[1]){
db$DIV_HPI_NSA[i] = db[i, paste0(db$DIVISION[i],"_NSA")]
db$DIV_HPI_SA[i] = db[i, paste0(db$DIVISION[i],"_SA")]
}
db}
#the very same function using 'apply', which is supposed to be much faster than base loop
x2= function(db){
db %>% apply(1, function(x){
x["DIV_HPI_NSA2"] = x[paste0(x["DIVISION"],"_NSA")]
x["DIV_HPI_SA2"] = x[paste0(x["DIVISION"],"_SA")]
x
}) %>% t %>% as.data.frame
}
#if DIVISION have few values, you can use 'dplyr::case_when' this way
x3= function(db){
db %>% mutate(output2 = case_when(
DIVISION=="Middle_Atlantic" ~ Middle_Atlantic_NSA,
DIVISION=="South_Atlantic" ~ South_Atlantic_NSA,
DIVISION=="West_North_Central" ~ West_North_Central_NSA
))
}
#but if DIVISION can take a lot of values, you may have to rlang the function a bit
x4= function(db){
db = db %>% mutate(output2 = -999) #start with dummy value
xx=data.frame(A=dff$DIVISION, B=paste0(dff$DIVISION,"_NSA"), stringsAsFactors = F) %>%
unique %>%
split(seq(nrow(.))) #turns xx into a list of its rows
for(i in xx){
db = db %>% mutate(output2 = case_when(DIVISION==i$A ~ !!sym(i$B), T~output2))
}
db
}
#here are some replicates of your dataset to increase the number of lines
df60 = df[rep(seq_len(nrow(df)), 10),]
df600 = df[rep(seq_len(nrow(df)), 100),]
df6k = df[rep(seq_len(nrow(df)), 1000),]
df60k = df[rep(seq_len(nrow(df)), 10000),]
df600k = df[rep(seq_len(nrow(df)), 100000),]
#the benchmark of every function with every dataset
(mbm=microbenchmark(
base = x1(df),
base60 = df60 %>% x1,
base600 = df600 %>% x1,
base6k = df6k %>% x1,
apply = x2(df),
apply60 = df60 %>% x2,
apply600 = df600 %>% x2,
apply6k = df6k %>% x2,
dplyr = x3(df),
dplyr60 = x3(df60),
dplyr600 = x3(df600),
dplyr6k = x3(df6k),
dplyr60k = x3(df60k),
dplyr600k = x3(df600k),
dplyrcw = x4(df),
dplyrcw60 = x4(df60),
dplyrcw600 = x4(df600),
dplyrcw6k = x4(df6k),
dplyrcw60k = x4(df60k),
dplyrcw600k = x4(df600k),
times=6
))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base 515.283 599.3395 664.6767 683.396 739.3735 795.351 3 a
# base60 5125.835 5209.1620 5515.3047 5292.489 5710.0395 6127.590 3 a
# base600 53225.746 53300.1395 66678.0210 53374.533 73404.1585 93433.784 3 b
# base6k 587666.127 618005.9505 629841.8157 648345.774 650929.6600 653513.546 3 d
# apply 1220.559 1272.8895 1342.4810 1325.220 1403.4420 1481.664 3 a
# apply60 2265.710 2384.9575 2497.3980 2504.205 2613.2420 2722.279 3 a
# apply600 10852.649 11579.6225 12047.9227 12306.596 12645.5595 12984.523 3 a
# apply6k 114463.342 125155.8980 137072.6593 135848.454 148377.3180 160906.182 3 c
# dplyr 1298.964 1352.9355 1433.0417 1406.907 1500.0805 1593.254 3 a
# dplyr60 1604.559 1647.0435 1713.2313 1689.528 1767.5675 1845.607 3 a
# dplyr600 1357.676 1456.6845 1556.4223 1555.693 1655.7955 1755.898 3 a
# dplyr6k 1954.644 1970.1425 2025.0260 1985.641 2060.2170 2134.793 3 a
# dplyr60k 6366.085 6584.1590 6809.2833 6802.233 7030.8825 7259.532 3 a
# dplyr600k 46893.576 53406.6235 58086.0983 59919.671 63682.3595 67445.048 3 b
# dplyrcw 5824.182 5834.0285 5999.5897 5843.875 6087.2935 6330.712 3 a
# dplyrcw60 5591.885 5683.0535 6032.4097 5774.222 6252.6720 6731.122 3 a
# dplyrcw600 5664.820 5811.2360 5900.6413 5957.652 6018.5520 6079.452 3 a
# dplyrcw6k 6390.883 6522.7120 9003.2733 6654.541 10309.4685 13964.396 3 a
# dplyrcw60k 14379.395 14936.6140 15179.6070 15493.833 15579.7130 15665.593 3 a
# dplyrcw600k 85238.503 86607.3005 92601.6017 87976.098 96283.1510 104590.204 3 b
结论
对于6k行数据集,
apply
(137s)比base
(630s)快6倍香草
dplyr
甚至更快(2s)烂
dplyr
比香草(9s)慢一点时间似乎随着
base
和apply
线性增长,为100ms /线,因此8M线应花费大约8M秒= 1周。虽然
dplyr
时间似乎呈指数增长,所以我不能说它是否适用于您的大型数据集。关于r - dplyr或矢量化方法中的动态变量评估,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52790411/