本文介绍了dplyr用于行分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个df阶层,每个阶层都有1000个样本,该样本来自该阶层的估计值的后验分布。

  mydf<-as.data.frame(lapply(seq(1,1000),rnorm,n = 100)) 
colnames(mydf)<-paste('s',seq(1,ncol(mydf)),sep ='')

我想为每行分配几分位数的列。在经典R中,我会这样写。

  quant<-t(apply(mydf,1,分位数,概率= c(.025,.5,.975)))
colnames(quants)<-c('s_lo','s_med','s_hi')
mydf<-cbind(mydf,我怀疑在 dplyr中有直接的方法 b b b code>(也许是 rowwise ?),但是我的尝试失败了。有想法吗?

解决方案

dplyr 并未针对此类基于行的计算进行优化。尽管您可以使用 rowwise()做到这一点,但我还是建议不要这样做:性能会很糟糕。最好的速度可能是期望矩阵并且可以在行上进行操作的东西。我建议应用



而不是处理100x1000 data.frame ,为简便起见,我将使用5列:

  set.seed(2)
mydf <-as.data.frame(lapply(seq(1,5),rnorm,n = 10))
colnames(mydf)<-paste('s',seq(1,ncol(mydf)) ),sep ='')

转换为矩阵时,$ c>才是合理的。在这种情况下,它们都是数字,所以我们很安全。 (如果数据框中有非数字列,请仅在此处提取所需的列,然后将它们绑定回去。)

  mymtx<-as.matrix(mydf)
apply(mymtx,1,分位数,c(0.1,0.9))
#[,1] [,2] [,3] [,4 ] [,5] [,6] [,7] [,8] [,9] [,10]
#10%1.028912 1.430939 1.999521 0.305907 1.753824 0.03267599 1.934381 1.270504 2.995816 1.489634
#90%4.950067 3.807735 4.881554 6.123989 4.886388 5.55628806 4.207605 4.184460 4.406384 3.782134

使用 apply <这样的结果是基于行的形式,也许是从人们期望的结果开始转置的。只需将其包装在 t(...)中,您就会看到您可能期望的列。



此可以使用 cbind 或类似的函数与原始数据帧重新组合。



这可以在这样的管道中完成:

  mydf%&%;%
bind_cols(as.data.frame(t(apply(。,1,分位数,c(0.1,0.9)))))
#s1 s2 s3 s4 s5 10%90%
#1 0.1030855 2.4176508 5.0908192 4.738939 4.616414 1.02891157 4.950067
#2 1.1848492 2.9817528 1.8000742 4.318960 3.040897 1.43093918 3.807735
#3 2.5878453 1.6073046 4.5896382 5.076164 4.158295 1.99952092 4.881554
#4 -0.1303757 0.9603310 4.9546516 3.715842 6.903547 0.30590700 6.123989
#5 0.9197482 3.7822290 3.0049378 3.223325 5.622494 1.75382406 4.886388 $ 0.3110691 0.5482936 3.404340 6.990920 0.03267599 5.556288
#7 1.7079547 2.8786046 3.4772373 2.274020 4.694516 1.93438093 4.2 07605
#8 0.7603020 2.0358067 2.4034418 3.097416 4.909156 1.27050387 4.184460
#9 2.9844739 3.0128287 3.7922033 3.440938 4.815839 2.99581584 4.406384
#10 0.8612130 2.4322652 3.2896367 3.753487 3.801232 1.48963385 3.782134

我将该列留给您命名。


I have a df of strata, each of which has 1000 samples from a posterior distribution of the estimates from that stratum.

mydf <- as.data.frame(lapply(seq(1, 1000), rnorm, n=100))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

I want to add columns for a few quantiles of the distribution for each row. In classic R, I'd write this.

quants <- t(apply(mydf, 1, quantile, probs=c(.025, .5, .975)))
colnames(quants) <- c('s_lo', 's_med', 's_hi')
mydf <- cbind(mydf, quants)

I suspect there's a direct way to do this in dplyr (maybe rowwise?) but my attempts have failed. Ideas?

解决方案

dplyr is not optimized for row-based calculations like that. Though you can do this with rowwise(), I recommend against it: performance will be abysmal. Your best speed will likely be with something that expects a matrix, and can operate on the rows. I suggest apply.

Instead of dealing with a 100x1000 data.frame, for brevity I'll go with 5 columns:

set.seed(2)
mydf <- as.data.frame(lapply(seq(1, 5), rnorm, n=10))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

Converting to a matrix is only reasonable if all columns are of the same class. In this case, they are all numeric so we are safe. (If you have non-numeric columns in the dataframe, extract only the ones you need here and bind them back in later.)

mymtx <- as.matrix(mydf)
apply(mymtx, 1, quantile, c(0.1, 0.9))
#         [,1]     [,2]     [,3]     [,4]     [,5]       [,6]     [,7]     [,8]     [,9]    [,10]
# 10% 1.028912 1.430939 1.999521 0.305907 1.753824 0.03267599 1.934381 1.270504 2.995816 1.489634
# 90% 4.950067 3.807735 4.881554 6.123989 4.886388 5.55628806 4.207605 4.184460 4.406384 3.782134

One notable with using apply like this is that the result is in row-based form, perhaps transposed from what one would expect. Simply wrap it in t(...) and you'll see the columns you might expect.

This can be recombined with the original dataframe using cbind or similar function.

This can be done in a pipeline like so:

mydf %>%
  bind_cols(as.data.frame(t(apply(., 1, quantile, c(0.1, 0.9)))))
#            s1         s2        s3       s4       s5        10%      90%
# 1   0.1030855  2.4176508 5.0908192 4.738939 4.616414 1.02891157 4.950067
# 2   1.1848492  2.9817528 1.8000742 4.318960 3.040897 1.43093918 3.807735
# 3   2.5878453  1.6073046 4.5896382 5.076164 4.158295 1.99952092 4.881554
# 4  -0.1303757  0.9603310 4.9546516 3.715842 6.903547 0.30590700 6.123989
# 5   0.9197482  3.7822290 3.0049378 3.223325 5.622494 1.75382406 4.886388
# 6   1.1324203 -0.3110691 0.5482936 3.404340 6.990920 0.03267599 5.556288
# 7   1.7079547  2.8786046 3.4772373 2.274020 4.694516 1.93438093 4.207605
# 8   0.7603020  2.0358067 2.4034418 3.097416 4.909156 1.27050387 4.184460
# 9   2.9844739  3.0128287 3.7922033 3.440938 4.815839 2.99581584 4.406384
# 10  0.8612130  2.4322652 3.2896367 3.753487 3.801232 1.48963385 3.782134

I'll leave the column naming up to you.

这篇关于dplyr用于行分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-16 09:35