r - dplyr汇总的变量结果，取决于输出变量的命名

我将dplyr软件包（dplyr 0.4.3; R 3.2.3）用于分组数据（summarise）的基本摘要，但得到的结果不一致（“ sd”为NaN，“' N“）。更改输出的“名称”会产生不同的效果（以下示例）。

到目前为止的结果摘要：

plyr软件包未加载，如果先加载，我知道这可能会导致dplyr问题。
使用或不使用NA数据都可获得相同的结果（未显示）。
可以通过使用camelCase变量命名（未显示）或使用名称中没有非字母数字分隔符的输出变量来解决问题。
根据“。”的组合，仍可获得有效结果。或输出col名称中的“ _”。

问题：尽管可以解决此问题，但是我是否违反了我正在违反的基本变量命名规则，还是存在需要解决的程序问题？我看过其他一些行为可变的问题，但不是全部。

谢谢，马特

示例数据：

library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
       time=rep(1:3, 3),
       glucose=c(90,150, 200,
                 100,150,200,
                 80,100,150))

示例：sd给出NaN和不准确的n

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose.sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

我想知道使用“。”是否存在问题。名义上
或使用与数据框中相同的名称。从输出中删除现有的df col名称可解决此问题

df %>% group_by(time) %>%
  summarise(avg=mean(glucose, na.rm=TRUE),
        stdv=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time      avg     stdv     n
  (int)    (dbl)    (dbl) (int)
1     1  90.0000 10.00000     3
2     2 133.3333 28.86751     3
3     3 183.3333 28.86751     3

即使删除了“ glucose.sd”，删除“葡萄糖”摘要也将对其进行修复。
示例：去除“葡萄糖”后结果正常

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose.sd     n
  (int)      (dbl) (int)
1     1   10.00000     3
2     2   28.86751     3
3     3   28.86751     3

如果我在第一个摘要中添加“ glucose.mean”，则效果很好

df %>% group_by(time) %>%
  summarise(glucose.mean=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))

   time glucose.mean glucose.sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

使用不带“。”的变量名时出现相同的错误。
因此，使用“。”不只是一个问题。名义上

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose_sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

将“葡萄糖”重命名为“ glucose_mean”

df %>% group_by(time) %>%
  summarise(glucose_mean=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose_mean glucose_sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

最佳答案

您在summarize中指定的转换按照它们出现的顺序执行，这意味着如果您更改变量值，则这些新值将出现在后续列中（这与基本函数tranform()不同）。当你做

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

glucose=mean(glucose, na.rm=TRUE)部分更改了glucose变量的值，这样，当您计算glucose.sd=sd(glucose, na.rm=TRUE)部分时，sd()看不到原始的葡萄糖值，而是看到了原始值的平均值。价值观。如果您重新排序列，它将起作用。

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)),
        glucose=mean(glucose, na.rm=TRUE))

如果您想知道为什么这是默认行为，那是因为创建列然后在转换中稍后使用该列值通常是很好的选择。例如，使用mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)