本文介绍了在 R 中的分组数据帧中使用来自大型数据帧的多分位数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下一个问题,我有一个大数据框,我必须从一个变量中提取分位数,但要按实例分组:

list_q %拉(mpg) %>%分位数(probs = seq(0, 1, 0.25),na.rm = TRUE)list_q[[i]] 

使用此输出:

[[3]]0% 25% 50% 75% 100%10.4 14.5 15.5 18.4 21.5[[4]]0% 25% 50% 75% 100%17.800 21.000 22.800 28.075 33.900[[5]]0% 25% 50% 75% 100%15.0 15.8 19.7 26.0 30.4

现在,我需要对变量均值进行分组并确定它属于哪个分位数,但使用原始度量:

a %group_by(齿轮,碳水化合物)%>%总结(mpg_mean = mean(mpg))%>%取消分组()齿轮碳水化合物mpg_mean<dbl><dbl><dbl>1 3 1 20.32 3 2 17.23 3 3 16.34 3 4 12.65 4 1 29.16 4 2 24.87 4 4 19.88 5 2 28.29 5 4 15.810 5 6 19.711 5 8 15

所以我可以这样做:

g3<-a%>%过滤器(齿轮== 3)%>%变异(分位数 = 切割(mpg_mean,list_q[[3]],标签 = FALSE,include.lowest = TRUE))g4<-a%>%过滤器(齿轮== 4)%>%变异(分位数 = cut(mpg_mean,list_q[[4]],标签 = FALSE,include.lowest = TRUE))g5<-a%>%过滤器(齿轮== 5)%>%变异(分位数 = 切割(mpg_mean,list_q[[5]],标签 = FALSE,include.lowest = TRUE))绑定行(g3,g4,g5)

获得:

# tibble: 11 x 4齿轮碳水化合物 mpg_mean 分位数<dbl><dbl><dbl><int>1 3 1 20.3 42 3 2 17.2 33 3 3 16.3 34 3 4 12.6 15 4 1 29.1 46 4 2 24.8 37 4 4 19.8 18 5 2 28.2 49 5 4 15.8 110 5 6 19.7 211 5 8 15 1

我想知道是否有更有效的方法

解决方案

我们可以先group_by gear 并将mpg的分位数存储在一个列表.然后我们也可以 group_by carb 获得 meanmpg 值,并使用之前存储在列表中的分位数cut 这是mpg 列的意思.

库(dplyr)mtcars %>%group_by(齿轮)%>%变异(gear_q = 列表(分位数(mpg)))%>%group_by(carb, add = TRUE) %>%总结(mpg_mean = mean(mpg),gear_q = list(first(gear_q))) %>%变异(分位数=切割(mpg_mean,第一(gear_q),标签 = FALSE,include.lowest = TRUE)) %>%选择(-gear_q)# 齿轮碳水化合物 mpg_mean 分位数# <dbl><dbl><dbl><int># 1 3 1 20.3 4# 2 3 2 17.2 3# 3 3 3 16.3 3# 4 3 4 12.6 1# 5 4 1 29.1 4# 6 4 2 24.8 3# 7 4 4 19.8 1# 8 5 2 28.2 4# 9 5 4 15.8 1#10 5 6 19.7 2#11 5 8 15 1

I have the next problem, I have a large dataframe, in which I have to extract the quantiles from a variable but by group, by instance:

list_q <- list()

for (i in 3:5){

  tmp <- mtcars %>% 
    filter(gear == i) %>% 
    pull(mpg) %>% 
    quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)

  list_q[[i]] <- tmp  

}

list_q

With this output:

[[3]]
  0%  25%  50%  75% 100% 
10.4 14.5 15.5 18.4 21.5 

[[4]]
    0%    25%    50%    75%   100% 
17.800 21.000 22.800 28.075 33.900 

[[5]]
  0%  25%  50%  75% 100% 
15.0 15.8 19.7 26.0 30.4 

Now, I need to group the variable means and determine which quantile it belongs but using the original measures:

a <- mtcars %>% 
  group_by(gear, carb) %>% 
  summarize(mpg_mean = mean(mpg)) %>% 
  ungroup()

    gear  carb mpg_mean
   <dbl> <dbl>    <dbl>
 1     3     1     20.3
 2     3     2     17.2
 3     3     3     16.3
 4     3     4     12.6
 5     4     1     29.1
 6     4     2     24.8
 7     4     4     19.8
 8     5     2     28.2
 9     5     4     15.8
10     5     6     19.7
11     5     8     15 

So I could do this:


g3 <- a %>% 
  filter(gear == 3) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[3]], labels = FALSE, include.lowest = TRUE))

g4 <- a %>% 
  filter(gear == 4) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[4]], labels = FALSE, include.lowest = TRUE))

g5 <- a %>% 
  filter(gear == 5) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[5]], labels = FALSE, include.lowest = TRUE))

bind_rows(g3, g4, g5)

Obtaining:

# A tibble: 11 x 4
    gear  carb mpg_mean quantile
   <dbl> <dbl>    <dbl>    <int>
 1     3     1     20.3        4
 2     3     2     17.2        3
 3     3     3     16.3        3
 4     3     4     12.6        1
 5     4     1     29.1        4
 6     4     2     24.8        3
 7     4     4     19.8        1
 8     5     2     28.2        4
 9     5     4     15.8        1
10     5     6     19.7        2
11     5     8     15          1

I wish to know if there is a way to do this more efficiently

解决方案

We can first group_by gear and store the quantiles for mpg in a list. We can then also group_by carb to get mean of mpg value and use the quantiles stored in the list previously to cut this mean of mpg column.

library(dplyr)

mtcars %>% 
  group_by(gear) %>% 
  mutate(gear_q = list(quantile(mpg))) %>%
  group_by(carb, add = TRUE) %>%
  summarize(mpg_mean = mean(mpg), 
            gear_q = list(first(gear_q))) %>%
  mutate(quantile = cut(mpg_mean, first(gear_q), 
                        labels = FALSE, include.lowest = TRUE)) %>%
  select(-gear_q)

#    gear  carb mpg_mean quantile
#   <dbl> <dbl>    <dbl>    <int>
# 1     3     1     20.3        4
# 2     3     2     17.2        3
# 3     3     3     16.3        3
# 4     3     4     12.6        1
# 5     4     1     29.1        4
# 6     4     2     24.8        3
# 7     4     4     19.8        1
# 8     5     2     28.2        4
# 9     5     4     15.8        1
#10     5     6     19.7        2
#11     5     8     15          1

这篇关于在 R 中的分组数据帧中使用来自大型数据帧的多分位数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-21 06:26