本文介绍了如何在R中的多个条件下对变量计数进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 DF:

DF

ID       Var1         Var2              Type
IR-1     A1           X1,X2,X3          New
IR-2                                    Old
IR-3     A2           X1,X4             New
IR-4     A1           X1,X2,X3          New
IR-4     A3           X1,X2,X3          New
IR-4     A2           X1,X2,X3          New
IR-5     A1           X1,X3             New
IR-5     A2           X1,X3             New
IR-5     A3           X1,X3             New
IR-6                                    New
IR-7     A2           X1,X2,X3          New
IR-8                  X1,X2,X3          New
IR-9     A2           X8                New
IR-10                                   Old

所需的输出

Variables   Excl_Count   %         A1   A2   A3   X1   X2   X3   X4   GT   XN   XP   X8   KP   KL
Total       10           100.00%   3    5    2    6    4    5    1    0    0    0    1    0    0
Blank_Var1  4             40.00%   0    0    0    1    1    1    0    0    0    0    0    0    0
Blank_Var2  3             30.00%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_Both  3             30.00%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_New   1             33.33%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_Old   2             66.66%   0    0    0    0    0    0    0    0    0    0    0    0    0
Non_Blank   7             70.00%   3    5    2    6    4    5    1    0    0    0    1    0    0

通过使用 df,我想了解 Var1Var1Var2 的分布组合为唯一的ID.

By utilizing the df, I want to understand the distribution of Var1 across Var1 and Var2 combine for a unique ID.

哪里,

  • Total = 唯一 ID 的计数和变量命中(即 var1 和 var2)的水平行计数,包括那些 ID
  • Excl_count = 如果特定 ID 只有 1 个值作为 Var1Var2
  • 的一部分
  • Blank_Var1 = 唯一 ID 的计数,其中 Var1 为 Null/NA/Blank 或 0
  • Blank_Var2 = 唯一 ID 的计数,其中 Var2 为 Null/NA/Blank 或 0
  • Blank_Both = 唯一 ID 的计数,其中 Var1Var2 都是 Null/NA/Blank 或 0
  • Blank_New = 唯一 ID 的计数,其中 Var1Var2 要么是 Null/NA/Blank,要么是具有 Type = 新建
  • Blank_Old = 唯一 ID 的计数,其中 Var1Var2 要么是 Null/NA/Blank,要么是具有 Type = 旧
  • Non_Blank = 唯一 ID 的计数,其中 Var1Var2 不是 Null/NA/Blank 或 0
  • A1 到 KL 是每行对应的计数.
  • Total = Count of Unique ID and in horizontal row count of variable hit (i.e var1 & var2) inclusive for those ID
  • Excl_count = If a particular ID has only 1 value as part of either Var1 or Var2
  • Blank_Var1 = Count of unique ID where Var1 is Null/NA/Blank or 0
  • Blank_Var2 = Count of unique ID where Var2 is Null/NA/Blank or 0
  • Blank_Both = Count of unique ID where both Var1 and Var2 are either Null/NA/Blank or 0
  • Blank_New = Count of unique ID where both Var1 and Var2 are either Null/NA/Blank or 0 having Type = New
  • Blank_Old = Count of unique ID where both Var1 and Var2 are either Null/NA/Blank or 0 having Type = Old
  • Non_Blank = Count of unique ID where either Var1 or Var2 are not Null/NA/Blank or 0
  • A1 to KL are count corresponding to each row.

以下是我尝试过但未按预期工作的代码 -

Below is the code that I have tried but not working as expected -

library(RMySQL)
library(dplyr)
library(tidyverse)

# Count Total
    Total <- DF %>%
      dplyr::group_by(ID) %>%
      dplyr::mutate(count = n())
    # Excl_Count
    Excl_Count  <- DF %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = n_distinct(ID))
    # Blank_Var1
    Blank_Var1 <- DF %>% dplyr::filter(Var1 ==  '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Var2
    Blank_Var2 <- DF %>% dplyr::filter(Var2 ==  '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Both
    Blank_Both <- DF %>% dplyr::filter(Var1 ==  '' & Var2 == '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_New
    Blank_New <- DF %>% dplyr::filter(Var1 ==  '' & Type == 'New') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Old
    Blank_Old <- DF %>% dplyr::filter(Var1 ==  '' & Type == 'Old') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))

输入

structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4",
"IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9",
"IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2",
"A3", "", "A2", "", "A2", ""), Var2 = c("X1,X2,X3", "", "X1,X4",
"X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3",
"", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old",
"New", "New", "New", "New", "New", "New", "New", "New", "New",
"New", "New", "Old")), class = "data.frame", row.names = c(NA,
-14L))

推荐答案

创建三个中间对象(df1df2df3)和进行如下操作

Create three intermediate objects (df1, df2 and df3) and proceed as follows

#load libraries
library(tidyverse)

修改后的数据

df <- structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4",
                            "IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9",
                            "IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2",
                                               "A3", "", "A2", "", "A2", ""), Var2 = c("", "", "X1,X4",
                                                                                       "X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3",
                                                                                       "", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old",
                                                                                                                                       "New", "New", "New", "New", "New", "New", "New", "New", "New",
                                                                                                                                       "New", "New", "Old")), class = "data.frame", row.names = c(NA,
                                                                                                                                                                                                  -14L))

> df
      ID Var1     Var2 Type
1   IR-1   A1           New
2   IR-2                Old
3   IR-3   A2    X1,X4  New
4   IR-4   A1 X1,X2,X3  New
5   IR-4   A2 X1,X2,X3  New
6   IR-4   A3 X1,X2,X3  New
7   IR-5   A1    X1,X3  New
8   IR-5   A2    X1,X3  New
9   IR-5   A3    X1,X3  New
10  IR-6                New
11  IR-7   A2 X1,X2,X3  New
12  IR-8      X1,X2,X3  New
13  IR-9   A2       X8  New
14 IR-10                Old

在上面修改后的数据中,我为ID-1

In the above revised data, I emptied one row for var2 for ID-1

(假设Var2中最多三个标志)否则相应地修改separate参数

(Assuming that maximum of three flags in Var2) otherwise modify separate argument accordingly

df1 <- df %>%
  group_by(Var1) %>%
  mutate(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>%
  select(-c(ID, Var2, Type)) %>%
  filter(!(is.na(Var1) | Var1 == "" | Var1 == "0")) %>%
  pivot_longer(-Var1) %>%
  pivot_wider(id_cols = name, names_from = Var1, values_from = "value", values_fn = min) %>%
  ungroup()

# Check that Blank_var2 values aren't empty
# A tibble: 7 x 4
  name          A1    A2    A3
  <chr>      <int> <int> <int>
1 Total          3     5     2
2 Blank_var1     0     0     0
3 Blank_var2     1     0     0
4 Blank_Both     0     0     0
5 Blank_new      0     0     0
6 Blank_old      0     0     0
7 non_blank      3     5     2

#Second

df2 <- df %>% separate(Var2, into = paste0("Var2", 1:3), sep = ",") %>%
  pivot_longer(cols = c(Var21, Var22, Var23), names_to = "name", values_to = "Var2") %>%
  select(-name) %>%
  filter(!(is.na(Var2) | Var2 == "")) %>%
  group_by(Var2) %>%
  mutate(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>%
  select(-c(ID, Var1, Type)) %>%
  pivot_longer(-Var2) %>%
  pivot_wider(id_cols = name, names_from = Var2, values_from = "value", values_fn = min)

# Check that blank_var1 isn't empty this time
# A tibble: 7 x 6
  name          X1    X4    X2    X3    X8
  <chr>      <int> <int> <int> <int> <int>
1 Total          5     1     3     4     1
2 Blank_var1     1     0     1     1     0
3 Blank_var2     0     0     0     0     0
4 Blank_Both     0     0     0     0     0
5 Blank_new      0     0     0     0     0
6 Blank_old      0     0     0     0     0
7 non_blank      5     1     3     4     1

df3 <- df %>%
  summarise(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>% pivot_longer(cols = 1:7, names_to = "Variable", values_to = "Excl_count") %>%
  mutate(`%` = case_when(Variable == "Total" ~ "100.00%",
                         Variable %in% c("Blank_var1", "Blank_var2", "Blank_Both", "non_blank") ~ paste0(round(Excl_count*100/Excl_count[Variable == "Total"], 2), "%"),
                         Variable == "Blank_new" | Variable == "Blank_old" ~ paste0(round(Excl_count*100/Excl_count[Variable == "Blank_Both"], 2), "%")))

> df3
# A tibble: 7 x 3
  Variable   Excl_count `%`
  <chr>           <int> <chr>
1 Total              10 100.00%
2 Blank_var1          4 40%
3 Blank_var2          4 40%
4 Blank_Both          3 30%
5 Blank_new           1 33.33%
6 Blank_old           2 66.67%
7 non_blank           7 70%

最后,合并这三个来获得这个..

Lastly, merge all three to obtain this..

merge(df3, merge(df1, df2, by.x = "name", by.y = "name", sort = F),
      by.x = "Variable", by.y = "name", sort = F)

    Variable Excl_count       % A1 A2 A3 X1 X4 X2 X3 X8
1      Total         10 100.00%  3  5  2  5  1  3  4  1
2 Blank_var1          4     40%  0  0  0  1  0  1  1  0
3 Blank_var2          4     40%  1  0  0  0  0  0  0  0
4 Blank_Both          3     30%  0  0  0  0  0  0  0  0
5  Blank_new          1  33.33%  0  0  0  0  0  0  0  0
6  Blank_old          2  66.67%  0  0  0  0  0  0  0  0
7  non_blank          7     70%  3  5  2  5  1  3  4  1

说明

  • 你必须在相似的行上变异 3 次
  • 首先通过 group_byVar1
  • 其次通过 group_byVar2 上,但在将它们分离并旋转更长的时间后,将它们合并为一列
  • 最后/第三个没有任何分组(因此我使用了summarise)
  • 基本上所有三个中间对象中 mutate/summarise 中的参数完全相同并且复制/粘贴
  • 最后我使用了 baseR 中的 merge(你可以选择使用 left_join)
  • explanation

    • You have to mutate three times on similar lines
    • Firstly by group_by on Var1
    • Secondly by group_by on Var2 but after separating and pivoting longer these into one single column
    • Lastly/thirdly without any grouping (and hence I used summarise)
    • basically the argument within mutate/summarise in all three intermediate objects is exactly same and copied/paste
    • Lastly I used merge from baseR (you can optionally use left_join)
    • 这篇关于如何在R中的多个条件下对变量计数进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 12:41