问题描述
关于引用嵌套数据框中的数据列,我有一个非常简单的问题。
I have a very simple question about referencing data columns within a nested dataframe.
对于可重现的示例,我将嵌套 mtcars
通过变量 am
的两个值:
For a reproducible example, I'll nest mtcars
by the two values of variable am
:
library(tidyverse)
mtcars_nested <- mtcars %>%
group_by(am) %>%
nest()
mtcars_nested
给出的数据如下。
#> # A tibble: 2 x 2
#> # Groups: am [2]
#> am data
#> <dbl> <list>
#> 1 1 <tibble [13 × 10]>
#> 2 0 <tibble [19 × 10]>
如果我现在想使用 purrr :: map
取每个水平 am
If I now wanted to use purrr::map
to take the mean of mpg
for each level of am
我想知道为什么这行不通:
I wonder why this doesn't work:
take_mean_mpg <- function(df){
mean(df[["data"]]$mpg)
}
map(mtcars_nested, take_mean_mpg)
Error in df[["data"]] : subscript out of bounds
或更简单的问题是:嵌套后 mpg
列应如何正确引用。我知道这不起作用:
Or maybe a simpler question is: How should I properly reference the mpg
column, once it's nested. I know that this doesn't work:
mtcars_nested[["data"]]$mpg
推荐答案
数据帧(和tbls)是列列表,而不是行列表,所以当您将整个tbl mtcars_nest
传递给 map()
,它遍历列而不是行。您可以在函数中使用 mutate
和 map_dbl
,以便新列不是列表列。
dataframes (and tbls) are lists of columns, not lists of rows, so when you pass the whole tbl mtcars_nest
to map()
it is iterating over the columns not over the rows. You can use mutate
with your function, and map_dbl
so that your new columns is not a list column.
library(tidyverse)
mtcars_nested <- mtcars %>%
group_by(am) %>%
nest()
mtcars_nested
take_mean_mpg <- function(df){
mean(df$mpg)
}
mtcars_nested %>%
mutate(mean_mpg = map_dbl(.data[["data"]], take_mean_mpg))
.data [[ data]]
参数 map_dbl()
会从您的数据框中为其提供 data
列表列,而不是整个数据框。参数的 .data
部分与名为 data的列无关,它是来引用您的整个数据框。 [[ data]]
然后从数据框中检索名为 data的列。之所以使用mutate,是因为您试图(我认为可能是错误的)将一列带有平均值的列添加到嵌套数据框中。 mutate()
用于添加列,因此您添加的列等于 map()
的输出(或 map_dbl()
)与函数一起使用,它将返回平均值列表(或向量)。
The .data[["data"]]
argument to map_dbl()
gives it the data
list column from you dataframe to iterate over, rather than the entire dataframe. The .data
part of the argument has no relation to your column named "data", it is the rlang pronoun .data to reference your whole dataframe. [["data"]]
then retrieves the column named "data" from your dataframe. You use mutate because you are trying (I assumed, perhaps incorrectly) to add a column with the averages to the nested dataframe. mutate()
is used to add columns, so you add a column equal to the output of map()
(or map_dbl()
) with your function, which will return the list (or vector) of averages.
我一个令人困惑的概念。尽管 map()
通常用于遍历数据框的行,但从技术上讲,它遍历列表(请参见,其中在参数下显示:
This can me a confusing concept. Although map()
is often used to iterate over the rows of a dataframe, it technically iterates over a list (see the documentation, where under the arguments it says:
它也返回列表或向量。新闻是列只是值的列表,因此您将要迭代的列表(列)传递给它,然后将其分配给要存储它的列表(列)(此赋值发生在 mutate()
)。
It also returns a list or a vector. The good news is that columns are just lists of values, so you pass it the list (column) you want it to iterate over and assign it to the list (column) where you want it stored (this assignment happens with mutate()
).
这篇关于如何引用嵌套数据框中的列(然后使用purrr :: map)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!