问题描述
我喜欢 dplyr
和 tidyr
如何轻松地创建具有多个预测变量和结果的单个汇总表变量。让我感到困扰的一件事是在输出表中保留/定义预测变量的顺序及其因子水平的最后一步。
I love how easy dplyr
and tidyr
have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table.
我提出了以下解决方案,其中涉及使用 mutate
手动创建一个因子变量,将预测值和预测值(例如 gender_female)与所需输出顺序中的级别组合在一起。但是,如果有很多变量,我的解决方案就会有些冗长,我想知道是否有更好的方法吗?
I've come up with a solution of sorts (below), which involves using mutate
to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way?
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
# Statement below creates variable for ordering output
mutate(
pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
) %>%
group_by(pred_ord, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
ungroup() %>%
spread(key = outcome, value = n) %>%
separate(pred_ord, c("Predictor", "Pred_value"))
Source: local data frame [9 x 4]
Predictor Pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 gender Female 25 27
2 gender Male 11 10
3 gender Unknown 12 15
4 ethnicity Maori 10 9
5 ethnicity Pacific 7 7
6 ethnicity Asian 6 12
7 ethnicity Other 10 9
8 ethnicity European 5 4
9 ethnicity Unknown 10 11
Warning message:
attributes are not identical across measure variables; they will be dropped
上表是正确的,因为Predictor或Predictor值都不按字母顺序进行排序。
The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically.
编辑
根据要求,如果使用默认顺序(字母顺序),则生成此内容。有意义的是,将这些因素组合在一起后,它们将转换为字符变量,并且所有属性都将被删除。
As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped.
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n)
Source: local data frame [9 x 4]
predictor pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 ethnicity Asian 6 12
2 ethnicity European 5 4
3 ethnicity Maori 10 9
4 ethnicity Other 10 9
5 ethnicity Pacific 7 7
6 ethnicity Unknown 10 11
7 gender Female 25 27
8 gender Male 11 10
9 gender Unknown 12 15
Warning message:
attributes are not identical across measure variables; they will be dropped
推荐答案
如果您希望数据这样排列的因素,您需要将它们转换回因素,例如 gather
强制转换为字符(它会警告您)。您可以使用 gather
的 factor_key
参数来处理 predictor
,但您需要为 pred_value
组合级别,因为它现在结合了原始元素中的两个因素。简化一下:
If you want your data to be factors arranged as such, you'll need to convert them back to factors, as gather
coerces to character (which it warns you about). You can use gather
's factor_key
parameter to take care of predictor
, but you'll need to assemble levels for pred_value
as it now combines two factors from the original. Simplifying a bit:
library(tidyr)
library(dplyr)
dat %>%
gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
group_by(predictor, pred_value) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd),
fromLast = TRUE))) %>%
arrange(predictor, pred_value)
## # A tibble: 9 × 4
## predictor pred_value outcome1 outcome2
## <fctr> <fctr> <int> <int>
## 1 gender Female 25 27
## 2 gender Male 11 10
## 3 gender Unknown 12 15
## 4 ethnicity Maori 10 9
## 5 ethnicity Pacific 7 7
## 6 ethnicity Asian 6 12
## 7 ethnicity Other 10 9
## 8 ethnicity European 5 4
## 9 ethnicity Unknown 10 11
请注意,您需要使用 unique
使用 fromLast = TRUE
将重复的未知值排列到单个出现在正确的位置; 工会
会更早提出。
Note that you'll need to use unique
with fromLast = TRUE
to arrange the duplicate "Unknown" values into a single occurrence in the right place; union
will put it earlier.
这篇关于使用dplyr tidyr在汇总表中保留输入变量和因子水平的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!