本文介绍了R按多列分组,并根据不同的列每组的平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据集包含具有年龄,性别,州,收入,组的数据.组代表每个用户所属的组:

data set exist data with age, gender, state, income, group. Group represents the group that each user belongs to:

     group      gender state age       income
 1       3      Female  CA     33  $75,000 - $99,999
 2       3        Male  MA     41  $50,000 - $74,999
 3       3        Male  KY     32  $35,000 - $49,999
 4       2      Female  CA     23  $35,000 - $49,999
 5       3        Male  KY     25  $50,000 - $74,999
 6       3        Male  MA     21  $75,000 - $99,999
 7       3      Female  CA     33  $75,000 - $99,999
 8       3        Male  MA     41  $50,000 - $74,999
 9       3        Male  KY     32  $35,000 - $49,999
10       2      Female  CA     23  $35,000 - $49,999
11       3        Male  KY     25  $50,000 - $74,999
12       3      Female  MA     21  $75,000 - $99,999

上面是虚拟数据,目标是使概念正确.

Above is dummy data and goal is to get the concept correct.

目标是按组,性别,收入进行分组并获得计数,并且对于每个组,从属于该组的用户中获取平均年龄.然后将数据设置为以下结构:扩展版本"

The goal is to group by group, gender, income and get the count and for each group get the mean age from the users who belong to that group. Then set the data in following structure: "Expanded Version"

    group  male female CA  MA  KY  $35,000 - $49,999  $50,000 - $74,999 $75,000 - $99,999  mean_age
     2      0     2     2   0   0          2                1              0                   23
...

这是尝试:使用 dplyr

> data %>% group_by(group,
+ gender,
+ state,
+ income) %>%
+ summarize(n()) %>%
+ mutate(mean_age = mean(age))

我也在探索 spread 函数.

推荐答案

除了@treysp的答案,您还可以使用 unite spread 创建宽(且笨拙)的) 桌子.(我仅使用 as.data.frame()强制打印所有列).

In addition to @treysp's answer you could use unite and spread to create a wide (and unwieldy) table. (I'm using as.data.frame() only to force printing all columns).

require(tidyverse);
df %>%
    group_by(group, gender, state, income) %>%
    summarize(n = n(), mean_age = mean(age)) %>%
    unite(key, gender, state, income) %>%
    spread(key, n) %>% as.data.frame();
#  group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1     2       23                           2                          NA
#2     3       21                          NA                          NA
#3     3       25                          NA                          NA
#4     3       32                          NA                          NA
#5     3       33                          NA                           2
#6     3       41                          NA                          NA
#  Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1                          NA                        NA
#2                           1                        NA
#3                          NA                        NA
#4                          NA                         2
#5                          NA                        NA
#6                          NA                        NA
#  Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1                        NA                        NA                        NA
#2                        NA                        NA                         1
#3                         2                        NA                        NA
#4                        NA                        NA                        NA
#5                        NA                        NA                        NA
#6                        NA                         2                        NA
#


样本数据

df <- read.table(text =
    "group      gender state age       income
 1       3      Female  CA     33  '$75,000 - $99,999'
 2       3        Male  MA     41  '$50,000 - $74,999'
 3       3        Male  KY     32  '$35,000 - $49,999'
 4       2      Female  CA     23  '$35,000 - $49,999'
 5       3        Male  KY     25  '$50,000 - $74,999'
 6       3        Male  MA     21  '$75,000 - $99,999'
 7       3      Female  CA     33  '$75,000 - $99,999'
 8       3        Male  MA     41  '$50,000 - $74,999'
 9       3        Male  KY     32  '$35,000 - $49,999'
10       2      Female  CA     23  '$35,000 - $49,999'
11       3        Male  KY     25  '$50,000 - $74,999'
12       3      Female  MA     21  '$75,000 - $99,999'", header = T, row.names = 1)

这篇关于R按多列分组,并根据不同的列每组的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 05:19