在数据框列表上运行rapply

本文介绍了在数据框列表上运行rapply的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

要跟进两个rapply问题，请和此处 c0>仅适用于简单的类(即向量，矩阵)，而不适用于多面的data.frame类.

To follow-up on two rapply questions, here and here from years ago, it seems rapply only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame class.

在大多数情况下并在下面进行演示，rapply等价物嵌套在lapply及其变体包装程序v/sapply中，其中嵌套数与级别数相关.下面是我在向量，矩阵和数据帧类型之间嵌套lapply和rapply之间的测试场景.除了数据帧外，其他所有数据均无法均衡.

In most cases and demonstrated below, the rapply equivalent is nested lapply and its variant wrappers, v/sapply where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply and rapply between vector, matrix, and dataframe types. All but datafames fail to equalize.

问题

Base R中是否存在用于rapply()的用例，以便像对向量或矩阵的列表那样递归地对数据帧列表运行操作并返回数据帧列表?如果不是，这是错误还是应该在?rapply基本R文档中警告?大多数教程没有显示rapply数据框示例.

Is there a use case in base R for rapply() to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply base R docs? Most tutorials do not show rapply dataframe examples.

一维 (字符向量)

下面显示了rapply与嵌套字符lapply在运行字符数的简单字符向量上的等效方式，甚至还显示了rapply在处理上的速度显着提高:

Below shows how rapply is equivalent to nested lapply on simple character vectors running count of characters, and even shows how rapply is appreciably faster in processing:

library(microbenchmark)

ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
                    Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
                    SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
                    PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
                    XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))

microbenchmark(
  ScriptsLists1 <- lapply(ScriptLists, function(i){
    unname(vapply(i, function(x){ 
      nchar(x)
      }, numeric(1)))
    })
)
# Unit: microseconds
# min      lq     mean   median      uq     max neval
# 384 408.782 524.1363 434.7675 678.016 886.377   100

microbenchmark(
  ScriptsLists2 <- rapply(ScriptLists, function(x){
    nchar(x)
  }, how="list")
)
# Unit: microseconds
# min           lq     mean   median     uq     max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722   100

all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE

二维类型 (矩阵与数据框)

输入数据框(摘自 StackOverflow顶级用户的最高年份排名 )，以按语言标签(C#，Python，R等)构建顶级用户的数据框列表.

Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).

df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L, 
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L, 
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L, 
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin", 
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight", 
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff", 
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet", 
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler", 
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo", 
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch", 
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L, 
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L, 
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L, 
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters", 
"http://www.stackoverflow.com//users/1144035/gordon-linoff", 
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r", 
"http://www.stackoverflow.com//users/1227923/alexey-mezenin", 
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo", 
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler", 
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc", 
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen", 
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin", 
"http://www.stackoverflow.com//users/209103/frank-van-puffelen", 
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer", 
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet", 
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael", 
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan", 
"http://www.stackoverflow.com//users/335858/dasblinkenlight", 
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch", 
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew", 
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet", 
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv", 
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre", 
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L, 
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L, 
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L, 
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands", 
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States", 
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA", 
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France", 
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States", 
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom", 
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria", 
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States", 
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L, 
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L, 
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604", 
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886", 
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179", 
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475", 
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188", 
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"), 
    total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L, 
    3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L, 
    8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L, 
    16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134", 
    "220,515", "229,553", "233,368", "269,380", "289,989", "30,027", 
    "31,602", "36,950", "401,595", "41,183", "411,535", "418,780", 
    "455,157", "475,813", "499,408", "507,043", "508,310", "509,365", 
    "525,176", "529,137", "61,135", "616,135", "64,476", "651,397", 
    "672,118", "7,932", "703,046", "709,683", "71,032", "77,211", 
    "83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L, 
    2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L, 
    8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L, 
    15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android", 
    "angular2", "c", "c#", "firebase", "git", "java", "javascript", 
    "laravel", "pandas", "python", "r", "regex", "ruby", "sql", 
    "swift"), class = "factor"), tag2 = structure(c(23L, 24L, 
    19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L, 
    10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L, 
    7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net", 
    "arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database", 
    "github", "hibernate", "html", "ios", "java", "javascript", 
    "jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x", 
    "ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"), 
    tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L, 
    5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L, 
    19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L, 
    7L, 14L, 2L), .Label = c(".net", "android", "android-intent", 
    "arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#", 
    "c++", "css", "dataframe", "docker", "git-pull", "html", 
    "java", "java-8", "javascript", "jquery", "laravel-5.3", 
    "mysql", "numpy", "object", "protractor", "python-2.7", "r", 
    "servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
    ), class = "factor")), .Names = c("user", "link", "location", 
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA, 
-36L))

R代码

以下方法在类型，矩阵或数据框中的平均 year_rep 和 total_rep (第5/6行)列.确保在设置块中更改return语句，换出注释的节类型.请注意，仅rapply()用于矩阵返回与嵌套lapply相同，但不适用于数据帧返回.

Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply() for matrix returns same as nested lapply, but not for dataframe returns.

# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
                  java=list(), javascript=list(), ruby=list(), `c++`=list())

LangLists <- setNames(mapply(function(i, j){

  df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
  df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
  df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))

  return(list(as.matrix(df)))   # MATRIX TYPE
  # return(list(df))            # DF TYPE

}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------

# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    cbind(mean(as.numeric(df[,5])),
          mean(as.numeric(df[,6])))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
  cbind(mean(as.numeric(i[,5])),
        mean(as.numeric(i[,6])))      
}, classes="matrix", how="list")

all.equal(LangLists1, LangLists2)
# [1] TRUE


# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    data.frame(year_rep=mean(df$year_rep),
               total_rep=mean(df$total_rep))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
    data.frame(year_rep=mean(i$year_rep),
               total_rep=mean(i$total_rep))      
}, classes="data.frame", how="list")

all.equal(LangLists1, LangLists2)

# [1] "Component "c#": Component 1: Names: 2 string mismatches"                                               
# [2] "Component "c#": Component 1: Attributes: < names for target but not for current >"                     
# [3] "Component "c#": Component 1: Attributes: < Length mismatch: comparison on first 0 components >"        
# [4] "Component "c#": Component 1: Length mismatch: comparison on first 2 components"                        
# [5] "Component "c#": Component 1: Component 1: Modes: numeric, NULL"  
...

实际上，尽管嵌套的lapply保留了 rep 表示的两列完整数据帧的列表，但数据帧的rapply却将基础数据帧转换为NULL列表.再次，为什么与向量/矩阵相比，rapply为什么无法返回原始数据帧列表?

In fact, whereas the nested lapply remains a list of intact dataframes of the two columns for rep means, the rapply for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?

# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL

# $`c#`[[1]]$user
# NULL

# $`c#`[[1]]$link
# NULL

# $`c#`[[1]]$location
# NULL

# $`c#`[[1]]$year_rep
# NULL

# $`c#`[[1]]$total_rep
# NULL

# $`c#`[[1]]$tag1
# NULL

# $`c#`[[1]]$tag2
# NULL

# $`c#`[[1]]$tag3
# NULL

# $python
# $python[[1]]
# $python[[1]]$X
# NULL

# $python[[1]]$user
# NULL

# $python[[1]]$link
# NULL

# $python[[1]]$location
# NULL

# $python[[1]]$year_rep
# NULL

# $python[[1]]$total_rep
# NULL

# $python[[1]]$tag1
# NULL

# $python[[1]]$tag2
# NULL

# $python[[1]]$tag3
# NULL

rapply

在数据框列表上运行rapply

问题描述

推荐答案