问题描述
数据集包含具有年龄,性别,州,收入,组
的数据.组代表每个用户所属的组:
data set exist data with age, gender, state, income, group
. Group represents the group that each user belongs to:
group gender state age income
1 3 Female CA 33 $75,000 - $99,999
2 3 Male MA 41 $50,000 - $74,999
3 3 Male KY 32 $35,000 - $49,999
4 2 Female CA 23 $35,000 - $49,999
5 3 Male KY 25 $50,000 - $74,999
6 3 Male MA 21 $75,000 - $99,999
7 3 Female CA 33 $75,000 - $99,999
8 3 Male MA 41 $50,000 - $74,999
9 3 Male KY 32 $35,000 - $49,999
10 2 Female CA 23 $35,000 - $49,999
11 3 Male KY 25 $50,000 - $74,999
12 3 Female MA 21 $75,000 - $99,999
上面是虚拟数据,目标是使概念正确.
Above is dummy data and goal is to get the concept correct.
目标是按组,性别,收入
进行分组并获得计数,并且对于每个组,从属于该组的用户中获取平均年龄.然后将数据设置为以下结构:扩展版本"
The goal is to group by group, gender, income
and get the count and for each group get the mean age from the users who belong to that group. Then set the data in following structure: "Expanded Version"
group male female CA MA KY $35,000 - $49,999 $50,000 - $74,999 $75,000 - $99,999 mean_age
2 0 2 2 0 0 2 1 0 23
...
这是尝试:使用 dplyr
> data %>% group_by(group,
+ gender,
+ state,
+ income) %>%
+ summarize(n()) %>%
+ mutate(mean_age = mean(age))
我也在探索 spread
函数.
推荐答案
除了@treysp的答案,您还可以使用 unite
和 spread
创建宽(且笨拙)的) 桌子.(我仅使用 as.data.frame()
强制打印所有列).
In addition to @treysp's answer you could use unite
and spread
to create a wide (and unwieldy) table. (I'm using as.data.frame()
only to force printing all columns).
require(tidyverse);
df %>%
group_by(group, gender, state, income) %>%
summarize(n = n(), mean_age = mean(age)) %>%
unite(key, gender, state, income) %>%
spread(key, n) %>% as.data.frame();
# group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1 2 23 2 NA
#2 3 21 NA NA
#3 3 25 NA NA
#4 3 32 NA NA
#5 3 33 NA 2
#6 3 41 NA NA
# Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1 NA NA
#2 1 NA
#3 NA NA
#4 NA 2
#5 NA NA
#6 NA NA
# Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1 NA NA NA
#2 NA NA 1
#3 2 NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA 2 NA
#
样本数据
df <- read.table(text =
"group gender state age income
1 3 Female CA 33 '$75,000 - $99,999'
2 3 Male MA 41 '$50,000 - $74,999'
3 3 Male KY 32 '$35,000 - $49,999'
4 2 Female CA 23 '$35,000 - $49,999'
5 3 Male KY 25 '$50,000 - $74,999'
6 3 Male MA 21 '$75,000 - $99,999'
7 3 Female CA 33 '$75,000 - $99,999'
8 3 Male MA 41 '$50,000 - $74,999'
9 3 Male KY 32 '$35,000 - $49,999'
10 2 Female CA 23 '$35,000 - $49,999'
11 3 Male KY 25 '$50,000 - $74,999'
12 3 Female MA 21 '$75,000 - $99,999'", header = T, row.names = 1)
这篇关于R按多列分组,并根据不同的列每组的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!