问题描述
要跟进两个rapply
问题,请 和此处 c0>仅适用于简单的类(即向量,矩阵),而不适用于多面的data.frame
类.
To follow-up on two rapply
questions, here and here from years ago, it seems rapply
only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame
class.
在大多数情况下并在下面进行演示,rapply
等价物嵌套在lapply
及其变体包装程序v/sapply
中,其中嵌套数与级别数相关.下面是我在向量,矩阵和数据帧类型之间嵌套lapply
和rapply
之间的测试场景.除了数据帧外,其他所有数据均无法均衡.
In most cases and demonstrated below, the rapply
equivalent is nested lapply
and its variant wrappers, v/sapply
where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply
and rapply
between vector, matrix, and dataframe types. All but datafames fail to equalize.
问题
Base R中是否存在用于rapply()
的用例,以便像对向量或矩阵的列表那样递归地对数据帧列表运行操作并返回数据帧列表?如果不是,这是错误还是应该在?rapply
基本R文档中警告?大多数教程没有显示rapply
数据框示例.
Is there a use case in base R for rapply()
to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply
base R docs? Most tutorials do not show rapply
dataframe examples.
一维 (字符向量)
下面显示了rapply
与嵌套字符lapply
在运行字符数的简单字符向量上的等效方式,甚至还显示了rapply
在处理上的速度显着提高:
Below shows how rapply
is equivalent to nested lapply
on simple character vectors running count of characters, and even shows how rapply
is appreciably faster in processing:
library(microbenchmark)
ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))
microbenchmark(
ScriptsLists1 <- lapply(ScriptLists, function(i){
unname(vapply(i, function(x){
nchar(x)
}, numeric(1)))
})
)
# Unit: microseconds
# min lq mean median uq max neval
# 384 408.782 524.1363 434.7675 678.016 886.377 100
microbenchmark(
ScriptsLists2 <- rapply(ScriptLists, function(x){
nchar(x)
}, how="list")
)
# Unit: microseconds
# min lq mean median uq max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722 100
all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE
二维类型 (矩阵与数据框)
输入数据框(摘自 StackOverflow顶级用户的最高年份排名 ),以按语言标签(C#,Python,R等)构建顶级用户的数据框列表.
Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).
df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L,
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L,
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L,
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin",
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight",
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff",
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet",
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler",
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo",
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch",
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L,
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L,
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L,
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters",
"http://www.stackoverflow.com//users/1144035/gordon-linoff",
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r",
"http://www.stackoverflow.com//users/1227923/alexey-mezenin",
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo",
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler",
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc",
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen",
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin",
"http://www.stackoverflow.com//users/209103/frank-van-puffelen",
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer",
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet",
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael",
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan",
"http://www.stackoverflow.com//users/335858/dasblinkenlight",
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch",
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew",
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet",
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv",
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre",
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L,
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L,
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L,
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands",
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States",
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA",
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France",
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States",
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom",
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria",
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States",
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L,
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L,
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L,
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604",
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886",
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179",
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475",
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188",
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"),
total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L,
3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L,
8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L,
16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134",
"220,515", "229,553", "233,368", "269,380", "289,989", "30,027",
"31,602", "36,950", "401,595", "41,183", "411,535", "418,780",
"455,157", "475,813", "499,408", "507,043", "508,310", "509,365",
"525,176", "529,137", "61,135", "616,135", "64,476", "651,397",
"672,118", "7,932", "703,046", "709,683", "71,032", "77,211",
"83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L,
2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L,
8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L,
15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android",
"angular2", "c", "c#", "firebase", "git", "java", "javascript",
"laravel", "pandas", "python", "r", "regex", "ruby", "sql",
"swift"), class = "factor"), tag2 = structure(c(23L, 24L,
19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L,
10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L,
7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net",
"arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database",
"github", "hibernate", "html", "ios", "java", "javascript",
"jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x",
"ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"),
tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L,
5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L,
19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L,
7L, 14L, 2L), .Label = c(".net", "android", "android-intent",
"arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#",
"c++", "css", "dataframe", "docker", "git-pull", "html",
"java", "java-8", "javascript", "jquery", "laravel-5.3",
"mysql", "numpy", "object", "protractor", "python-2.7", "r",
"servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
), class = "factor")), .Names = c("user", "link", "location",
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA,
-36L))
R代码
以下方法在类型,矩阵或数据框中的平均 year_rep 和 total_rep (第5/6行)列.确保在设置块中更改return语句,换出注释的节类型.请注意,仅rapply()
用于矩阵返回与嵌套lapply
相同,但不适用于数据帧返回.
Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply()
for matrix returns same as nested lapply
, but not for dataframe returns.
# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
java=list(), javascript=list(), ruby=list(), `c++`=list())
LangLists <- setNames(mapply(function(i, j){
df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))
return(list(as.matrix(df))) # MATRIX TYPE
# return(list(df)) # DF TYPE
}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------
# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
cbind(mean(as.numeric(df[,5])),
mean(as.numeric(df[,6])))
})
})
LangLists2 <- rapply(LangLists, function(i){
cbind(mean(as.numeric(i[,5])),
mean(as.numeric(i[,6])))
}, classes="matrix", how="list")
all.equal(LangLists1, LangLists2)
# [1] TRUE
# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
data.frame(year_rep=mean(df$year_rep),
total_rep=mean(df$total_rep))
})
})
LangLists2 <- rapply(LangLists, function(i){
data.frame(year_rep=mean(i$year_rep),
total_rep=mean(i$total_rep))
}, classes="data.frame", how="list")
all.equal(LangLists1, LangLists2)
# [1] "Component "c#": Component 1: Names: 2 string mismatches"
# [2] "Component "c#": Component 1: Attributes: < names for target but not for current >"
# [3] "Component "c#": Component 1: Attributes: < Length mismatch: comparison on first 0 components >"
# [4] "Component "c#": Component 1: Length mismatch: comparison on first 2 components"
# [5] "Component "c#": Component 1: Component 1: Modes: numeric, NULL"
...
实际上,尽管嵌套的lapply
保留了 rep 表示的两列完整数据帧的列表,但数据帧的rapply
却将基础数据帧转换为NULL列表.再次,为什么与向量/矩阵相比,rapply为什么无法返回原始数据帧列表?
In fact, whereas the nested lapply
remains a list of intact dataframes of the two columns for rep means, the rapply
for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?
# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL
# $`c#`[[1]]$user
# NULL
# $`c#`[[1]]$link
# NULL
# $`c#`[[1]]$location
# NULL
# $`c#`[[1]]$year_rep
# NULL
# $`c#`[[1]]$total_rep
# NULL
# $`c#`[[1]]$tag1
# NULL
# $`c#`[[1]]$tag2
# NULL
# $`c#`[[1]]$tag3
# NULL
# $python
# $python[[1]]
# $python[[1]]$X
# NULL
# $python[[1]]$user
# NULL
# $python[[1]]$link
# NULL
# $python[[1]]$location
# NULL
# $python[[1]]$year_rep
# NULL
# $python[[1]]$total_rep
# NULL
# $python[[1]]$tag1
# NULL
# $python[[1]]$tag2
# NULL
# $python[[1]]$tag3
# NULL
推荐答案
看来rapply
并非旨在处理data.frames列表.
It appears that rapply
is not designed to process lists of data.frames.
在?rapply
的详细信息"部分中说,如果
From the Details section of ?rapply
it says, if
由于data.frames是列表,因此它们不属于第一类.因此,它们属于所有其他包罗万象,并被dflt取代,其默认值为NULL.这说明了问题中最后一行代码的结果.
Since data.frames are lists, they do not fall under the first category. Thus, they fall into the all others catch-all and are replaced by dflt, whose default value is NULL. This explains the result of the final line of code in the question.
替换"方法的最后一个替代参数,看来在这种模式"下,数据.frames只是被忽略了
The final alternative argument to how is "replace" and it appears that data.frames are simply ignored under this "mode"
没有提及元素本身就是列表,而是使用how ="replace"来运行上面的代码似乎返回一个嵌套列表,其中data.frames现在是简单列表.因此,看来rapply
通过并删除了class属性.
No mention of elements which are themselves lists and running the code above with how="replace" appears to return a nested list where what were data.frames are now simple lists. So it appears that rapply
went through and stripped the class attribute.
这篇关于在数据框列表上运行rapply的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!