在对data.table进行分组后如何对有条件的行进行计数 | table进行分组后如何对有条件的行进行计数

本文介绍了在对data.table进行分组后如何对有条件的行进行计数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下数据框：

dat <- read_csv(
  "s1,s2,v1,v2
   a,b,10,20
   a,b,22,NA
   a,b,13,33
   c,d,3,NA
   c,d,4.5,NA
   c,d,10,20"
)

dat
#> # A tibble: 6 x 4
#>      s1    s2    v1    v2
#>   <chr> <chr> <dbl> <int>
#> 1     a     b  10.0    20
#> 2     a     b  22.0    NA
#> 3     a     b  13.0    33
#> 4     c     d   3.0    NA
#> 5     c     d   4.5    NA
#> 6     c     d  10.0    20

我想做的是

基于 v1 值

按 s1分组的行和 s2

计数每组中的总行数

计算每个组中 v2 不是 NA 的行。

Filter row based on v1 values
Group by s1 and s2
Count total lines in every group
Count lines in every group where v2 is not NA.

例如，对于 v1_filter> = 0 ，我们得到：

For example with v1_filter >= 0 we get this:

s1 s2 total_line non_na_line
a  b     3          2
c  d     3          1

并且使用 v1_filter> = 10 可以得到：

s1 s2 total_line non_na_line
a  b     2          1
c  d     1          1

我该如何实现与data.table或dplyr？
实际上，在 dat 中大约有3100万行。因此，我们需要
快速方法。

How can I achieve that with data.table or dplyr?In reality we have around ~31M rows in dat. So we needa fast method.

我坚持使用

 library(data.table)
 dat <- data.table(dat)

 v1_filter = 0
 dat[, v1 >= v1_filter,
     by=list(s1,s2)]

推荐答案

> library(readr)
> dat <- read_csv(
+   "s1,s2,v1,v2
+    a,b,10,20
+    a,b,22,NA
+    a,b,13,33
+    c,d,3,NA
+    c,d,4.5,NA
+    c,d,10,20"
+ )
>
> dat
# A tibble: 6 x 4
     s1    s2    v1    v2
  <chr> <chr> <dbl> <int>
1     a     b  10.0    20
2     a     b  22.0    NA
3     a     b  13.0    33
4     c     d   3.0    NA
5     c     d   4.5    NA
6     c     d  10.0    20

使用data.table，因为您有大数据

Using data.table since you have a big data

> library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)

不删除NA并将V1过滤器保持为0.1

Without removing NA and keeping V1 filter as 0.1

> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
   s1 s2 N
1:  a  b 3
2:  c  d 3

删除v2 NA并将V1过滤器保持为0.1

Removing v2 NA and keeping V1 filter as 0.1

> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
   s1 s2 N
1:  a  b 2
2:  c  d 1

将两者合并，并将V1过滤器保持为0

Merging the two and keeping V1 filter as 0

 > dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
       s1 s2 N i.N
    1:  a  b 3   2
    2:  c  d 3   1

这篇关于在对data.table进行分组后如何对有条件的行进行计数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！