使用 dplyr 时使用 rle 按运行分组

本文介绍了使用 dplyr 时使用 rle 按运行分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 R 中，我想在根据变量 x 的运行对数据进行分组后总结我的数据(也就是每组数据对应于数据的一个子集，其中连续的 x 值相同).例如，考虑下面的数据帧，我想在每次运行 x 时计算平均 y 值:

In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where consecutive x values are the same). For instance, consider the following data frame, where I want to compute the average y value within each run of x:

(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
#   x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7

在此示例中，x 变量的运行长度为 3，然后是 2，然后是 1，最后是 1，在这四次运行中取值 1、2、1 和 2.这些组中y的对应均值为2、4.5、6、7.

In this example, the x variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y in those groups are 2, 4.5, 6, and 7.

使用tapply，将dat$y 作为数据传递，使用rle 来在base R 中执行这个分组操作很容易从 dat$x 计算运行数，并传递所需的汇总函数:

It is easy to carry out this grouped operation in base R using tapply, passing dat$y as the data, using rle to compute the run number from dat$x, and passing the desired summary function:

tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
#   1   2   3   4
# 2.0 4.5 6.0 7.0

我想我可以直接将这个逻辑传递给 dplyr，但到目前为止我的尝试都以错误告终:

I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:

library(dplyr)
# First attempt
dat %>%
  group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
  summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'

# Attempt 2 -- maybe "with" is the problem?
dat %>%
  group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
  summarize(mean(y))
# Error: invalid subscript type 'closure'

为了完整起见，我可以使用 cumsum、head 和 tail 自己重新实现 rle 运行 ID 以解决这个问题，但这会使分组代码更难阅读，并且需要重新发明轮子:

For completeness, I could reimplement the rle run id myself using cumsum, head, and tail to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:

dat %>%
  group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%
  summarize(mean(y))
#     run mean(y)
#   (dbl)   (dbl)
# 1     1     2.0
# 2     2     4.5
# 3     3     6.0
# 4     4     7.0

是什么导致我基于 rle 的分组代码在 dplyr 中失败，是否有任何解决方案可以让我继续使用 rle按运行 ID 分组时?

What is causing my rle-based grouping code to fail in dplyr, and is there any solution that enables me to keep using rle when grouping by run id?

推荐答案

一个选项似乎是使用 {} 如:

One option seems to be the use of {} as in:

dat %>%
    group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
    summarize(mean(y))
#Source: local data frame [4 x 2]
#
#     yy mean(y)
#  (int)   (dbl)
#1     1     2.0
#2     2     4.5
#3     3     6.0
#4     4     7.0

如果未来的 dplyr 版本也有类似 data.table 的 rleid 函数就好了.

It would be nice if future dplyr versions also had an equivalent of data.table's rleid function.

我注意到使用 data.frame 或 tbl_df 输入时会出现此问题，但使用 tbl_dt 或 时不会出现此问题data.table 输入:

I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input:

dat %>%
    tbl_df %>%
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'

dat %>%
    tbl_dt %>%
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Source: local data table [4 x 2]

     yy mean(y)
  (int)   (dbl)
1     1     2.0
2     2     4.5
3     3     6.0
4     4     7.0

我在 dplyr 的 github 页面上将此报告为问题.

I reported this as an issue on dplyr's github page.

这篇关于使用 dplyr 时使用 rle 按运行分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！