本文介绍了如何在R中的多个条件下对变量计数进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下 DF:
DF
ID Var1 Var2 Type
IR-1 A1 X1,X2,X3 New
IR-2 Old
IR-3 A2 X1,X4 New
IR-4 A1 X1,X2,X3 New
IR-4 A3 X1,X2,X3 New
IR-4 A2 X1,X2,X3 New
IR-5 A1 X1,X3 New
IR-5 A2 X1,X3 New
IR-5 A3 X1,X3 New
IR-6 New
IR-7 A2 X1,X2,X3 New
IR-8 X1,X2,X3 New
IR-9 A2 X8 New
IR-10 Old
所需的输出
Variables Excl_Count % A1 A2 A3 X1 X2 X3 X4 GT XN XP X8 KP KL
Total 10 100.00% 3 5 2 6 4 5 1 0 0 0 1 0 0
Blank_Var1 4 40.00% 0 0 0 1 1 1 0 0 0 0 0 0 0
Blank_Var2 3 30.00% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_Both 3 30.00% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_New 1 33.33% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_Old 2 66.66% 0 0 0 0 0 0 0 0 0 0 0 0 0
Non_Blank 7 70.00% 3 5 2 6 4 5 1 0 0 0 1 0 0
通过使用 df
,我想了解 Var1
跨 Var1
和 Var2
的分布组合为唯一的ID
.
By utilizing the df
, I want to understand the distribution of Var1
across Var1
and Var2
combine for a unique ID
.
哪里,
- Total = 唯一
ID
的计数和变量命中(即 var1 和 var2)的水平行计数,包括那些ID
- Excl_count = 如果特定
ID
只有 1 个值作为Var1
或Var2
的一部分 - Blank_Var1 = 唯一
ID
的计数,其中Var1
为 Null/NA/Blank 或 0 - Blank_Var2 = 唯一
ID
的计数,其中Var2
为 Null/NA/Blank 或 0 - Blank_Both = 唯一
ID
的计数,其中Var1
和Var2
都是 Null/NA/Blank 或 0 - Blank_New = 唯一
ID
的计数,其中Var1
和Var2
要么是 Null/NA/Blank,要么是具有Type = 新建
- Blank_Old = 唯一
ID
的计数,其中Var1
和Var2
要么是 Null/NA/Blank,要么是具有Type = 旧
- Non_Blank = 唯一
ID
的计数,其中Var1
或Var2
不是 Null/NA/Blank 或 0 - A1 到 KL 是每行对应的计数.
- Total = Count of Unique
ID
and in horizontal row count of variable hit (i.e var1 & var2) inclusive for thoseID
- Excl_count = If a particular
ID
has only 1 value as part of eitherVar1
orVar2
- Blank_Var1 = Count of unique
ID
whereVar1
is Null/NA/Blank or 0 - Blank_Var2 = Count of unique
ID
whereVar2
is Null/NA/Blank or 0 - Blank_Both = Count of unique
ID
where bothVar1
andVar2
are either Null/NA/Blank or 0 - Blank_New = Count of unique
ID
where bothVar1
andVar2
are either Null/NA/Blank or 0 havingType
= New - Blank_Old = Count of unique
ID
where bothVar1
andVar2
are either Null/NA/Blank or 0 havingType
= Old - Non_Blank = Count of unique
ID
where eitherVar1
orVar2
are not Null/NA/Blank or 0 - A1 to KL are count corresponding to each row.
以下是我尝试过但未按预期工作的代码 -
Below is the code that I have tried but not working as expected -
library(RMySQL)
library(dplyr)
library(tidyverse)
# Count Total
Total <- DF %>%
dplyr::group_by(ID) %>%
dplyr::mutate(count = n())
# Excl_Count
Excl_Count <- DF %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = n_distinct(ID))
# Blank_Var1
Blank_Var1 <- DF %>% dplyr::filter(Var1 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Var2
Blank_Var2 <- DF %>% dplyr::filter(Var2 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Both
Blank_Both <- DF %>% dplyr::filter(Var1 == '' & Var2 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_New
Blank_New <- DF %>% dplyr::filter(Var1 == '' & Type == 'New') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Old
Blank_Old <- DF %>% dplyr::filter(Var1 == '' & Type == 'Old') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
输入
structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4",
"IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9",
"IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2",
"A3", "", "A2", "", "A2", ""), Var2 = c("X1,X2,X3", "", "X1,X4",
"X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3",
"", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old",
"New", "New", "New", "New", "New", "New", "New", "New", "New",
"New", "New", "Old")), class = "data.frame", row.names = c(NA,
-14L))
推荐答案
创建三个中间对象(df1
、df2
和 df3
)和进行如下操作
Create three intermediate objects (df1
, df2
and df3
) and proceed as follows
#load libraries
library(tidyverse)
修改后的数据
df <- structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4",
"IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9",
"IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2",
"A3", "", "A2", "", "A2", ""), Var2 = c("", "", "X1,X4",
"X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3",
"", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old",
"New", "New", "New", "New", "New", "New", "New", "New", "New",
"New", "New", "Old")), class = "data.frame", row.names = c(NA,
-14L))
> df
ID Var1 Var2 Type
1 IR-1 A1 New
2 IR-2 Old
3 IR-3 A2 X1,X4 New
4 IR-4 A1 X1,X2,X3 New
5 IR-4 A2 X1,X2,X3 New
6 IR-4 A3 X1,X2,X3 New
7 IR-5 A1 X1,X3 New
8 IR-5 A2 X1,X3 New
9 IR-5 A3 X1,X3 New
10 IR-6 New
11 IR-7 A2 X1,X2,X3 New
12 IR-8 X1,X2,X3 New
13 IR-9 A2 X8 New
14 IR-10 Old
在上面修改后的数据中,我为ID-1
In the above revised data, I emptied one row for var2
for ID-1
(假设Var2
中最多三个标志)否则相应地修改separate
参数
(Assuming that maximum of three flags in Var2
) otherwise modify separate
argument accordingly
df1 <- df %>%
group_by(Var1) %>%
mutate(Total = n_distinct(ID),
Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
non_blank = Total - Blank_Both) %>%
select(-c(ID, Var2, Type)) %>%
filter(!(is.na(Var1) | Var1 == "" | Var1 == "0")) %>%
pivot_longer(-Var1) %>%
pivot_wider(id_cols = name, names_from = Var1, values_from = "value", values_fn = min) %>%
ungroup()
# Check that Blank_var2 values aren't empty
# A tibble: 7 x 4
name A1 A2 A3
<chr> <int> <int> <int>
1 Total 3 5 2
2 Blank_var1 0 0 0
3 Blank_var2 1 0 0
4 Blank_Both 0 0 0
5 Blank_new 0 0 0
6 Blank_old 0 0 0
7 non_blank 3 5 2
#Second
df2 <- df %>% separate(Var2, into = paste0("Var2", 1:3), sep = ",") %>%
pivot_longer(cols = c(Var21, Var22, Var23), names_to = "name", values_to = "Var2") %>%
select(-name) %>%
filter(!(is.na(Var2) | Var2 == "")) %>%
group_by(Var2) %>%
mutate(Total = n_distinct(ID),
Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
non_blank = Total - Blank_Both) %>%
select(-c(ID, Var1, Type)) %>%
pivot_longer(-Var2) %>%
pivot_wider(id_cols = name, names_from = Var2, values_from = "value", values_fn = min)
# Check that blank_var1 isn't empty this time
# A tibble: 7 x 6
name X1 X4 X2 X3 X8
<chr> <int> <int> <int> <int> <int>
1 Total 5 1 3 4 1
2 Blank_var1 1 0 1 1 0
3 Blank_var2 0 0 0 0 0
4 Blank_Both 0 0 0 0 0
5 Blank_new 0 0 0 0 0
6 Blank_old 0 0 0 0 0
7 non_blank 5 1 3 4 1
df3 <- df %>%
summarise(Total = n_distinct(ID),
Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
non_blank = Total - Blank_Both) %>% pivot_longer(cols = 1:7, names_to = "Variable", values_to = "Excl_count") %>%
mutate(`%` = case_when(Variable == "Total" ~ "100.00%",
Variable %in% c("Blank_var1", "Blank_var2", "Blank_Both", "non_blank") ~ paste0(round(Excl_count*100/Excl_count[Variable == "Total"], 2), "%"),
Variable == "Blank_new" | Variable == "Blank_old" ~ paste0(round(Excl_count*100/Excl_count[Variable == "Blank_Both"], 2), "%")))
> df3
# A tibble: 7 x 3
Variable Excl_count `%`
<chr> <int> <chr>
1 Total 10 100.00%
2 Blank_var1 4 40%
3 Blank_var2 4 40%
4 Blank_Both 3 30%
5 Blank_new 1 33.33%
6 Blank_old 2 66.67%
7 non_blank 7 70%
最后,合并
这三个来获得这个..
Lastly, merge
all three to obtain this..
merge(df3, merge(df1, df2, by.x = "name", by.y = "name", sort = F),
by.x = "Variable", by.y = "name", sort = F)
Variable Excl_count % A1 A2 A3 X1 X4 X2 X3 X8
1 Total 10 100.00% 3 5 2 5 1 3 4 1
2 Blank_var1 4 40% 0 0 0 1 0 1 1 0
3 Blank_var2 4 40% 1 0 0 0 0 0 0 0
4 Blank_Both 3 30% 0 0 0 0 0 0 0 0
5 Blank_new 1 33.33% 0 0 0 0 0 0 0 0
6 Blank_old 2 66.67% 0 0 0 0 0 0 0 0
7 non_blank 7 70% 3 5 2 5 1 3 4 1
说明
- 你必须在相似的行上变异 3 次
- 首先通过
group_by
在Var1
- 其次通过
group_by
在Var2
上,但在将它们分离并旋转更长的时间后,将它们合并为一列 - 最后/第三个没有任何分组(因此我使用了
summarise
) - 基本上所有三个中间对象中 mutate/summarise 中的参数完全相同并且复制/粘贴
- 最后我使用了
baseR
中的merge
(你可以选择使用 left_join) - You have to mutate three times on similar lines
- Firstly by
group_by
onVar1
- Secondly by
group_by
onVar2
but after separating and pivoting longer these into one single column - Lastly/thirdly without any grouping (and hence I used
summarise
) - basically the argument within mutate/summarise in all three intermediate objects is exactly same and copied/paste
- Lastly I used
merge
frombaseR
(you can optionally use left_join)
explanation
这篇关于如何在R中的多个条件下对变量计数进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!