本文介绍了在数据框列表上运行rapply的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要跟进两个rapply问题,请 此处 c0>仅适用于简单的类(即向量,矩阵),而不适用于多面的data.frame类.

To follow-up on two rapply questions, here and here from years ago, it seems rapply only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame class.

在大多数情况下并在下面进行演示,rapply等价物嵌套在lapply及其变体包装程序v/sapply中,其中嵌套数与级别数相关.下面是我在向量,矩阵和数据帧类型之间嵌套lapplyrapply之间的测试场景.除了数据帧外,其他所有数据均无法均衡.

In most cases and demonstrated below, the rapply equivalent is nested lapply and its variant wrappers, v/sapply where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply and rapply between vector, matrix, and dataframe types. All but datafames fail to equalize.

问题

Base R中是否存在用于rapply()的用例,以便像对向量或矩阵的列表那样递归地对数据帧列表运行操作并返回数据帧列表?如果不是,这是错误还是应该在?rapply基本R文档中警告?大多数教程没有显示rapply数据框示例.

Is there a use case in base R for rapply() to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply base R docs? Most tutorials do not show rapply dataframe examples.

一维 (字符向量)

下面显示了rapply与嵌套字符lapply在运行字符数的简单字符向量上的等效方式,甚至还显示了rapply在处理上的速度显着提高:

Below shows how rapply is equivalent to nested lapply on simple character vectors running count of characters, and even shows how rapply is appreciably faster in processing:

library(microbenchmark)

ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
                    Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
                    SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
                    PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
                    XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))

microbenchmark(
  ScriptsLists1 <- lapply(ScriptLists, function(i){
    unname(vapply(i, function(x){ 
      nchar(x)
      }, numeric(1)))
    })
)
# Unit: microseconds
# min      lq     mean   median      uq     max neval
# 384 408.782 524.1363 434.7675 678.016 886.377   100

microbenchmark(
  ScriptsLists2 <- rapply(ScriptLists, function(x){
    nchar(x)
  }, how="list")
)
# Unit: microseconds
# min           lq     mean   median     uq     max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722   100

all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE

二维类型 (矩阵与数据框)

输入数据框(摘自 StackOverflow顶级用户的最高年份排名 ),以按语言标签(C#,Python,R等)构建顶级用户的数据框列表.

Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).

df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L, 
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L, 
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L, 
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin", 
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight", 
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff", 
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet", 
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler", 
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo", 
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch", 
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L, 
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L, 
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L, 
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters", 
"http://www.stackoverflow.com//users/1144035/gordon-linoff", 
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r", 
"http://www.stackoverflow.com//users/1227923/alexey-mezenin", 
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo", 
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler", 
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc", 
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen", 
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin", 
"http://www.stackoverflow.com//users/209103/frank-van-puffelen", 
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer", 
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet", 
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael", 
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan", 
"http://www.stackoverflow.com//users/335858/dasblinkenlight", 
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch", 
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew", 
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet", 
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv", 
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre", 
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L, 
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L, 
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L, 
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands", 
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States", 
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA", 
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France", 
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States", 
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom", 
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria", 
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States", 
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L, 
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L, 
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604", 
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886", 
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179", 
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475", 
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188", 
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"), 
    total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L, 
    3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L, 
    8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L, 
    16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134", 
    "220,515", "229,553", "233,368", "269,380", "289,989", "30,027", 
    "31,602", "36,950", "401,595", "41,183", "411,535", "418,780", 
    "455,157", "475,813", "499,408", "507,043", "508,310", "509,365", 
    "525,176", "529,137", "61,135", "616,135", "64,476", "651,397", 
    "672,118", "7,932", "703,046", "709,683", "71,032", "77,211", 
    "83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L, 
    2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L, 
    8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L, 
    15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android", 
    "angular2", "c", "c#", "firebase", "git", "java", "javascript", 
    "laravel", "pandas", "python", "r", "regex", "ruby", "sql", 
    "swift"), class = "factor"), tag2 = structure(c(23L, 24L, 
    19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L, 
    10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L, 
    7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net", 
    "arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database", 
    "github", "hibernate", "html", "ios", "java", "javascript", 
    "jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x", 
    "ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"), 
    tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L, 
    5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L, 
    19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L, 
    7L, 14L, 2L), .Label = c(".net", "android", "android-intent", 
    "arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#", 
    "c++", "css", "dataframe", "docker", "git-pull", "html", 
    "java", "java-8", "javascript", "jquery", "laravel-5.3", 
    "mysql", "numpy", "object", "protractor", "python-2.7", "r", 
    "servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
    ), class = "factor")), .Names = c("user", "link", "location", 
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA, 
-36L))

R代码

以下方法在类型,矩阵或数据框中的平均 year_rep total_rep (第5/6行)列.确保在设置块中更改return语句,换出注释的节类型.请注意,仅rapply()用于矩阵返回与嵌套lapply相同,但不适用于数据帧返回.

Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply() for matrix returns same as nested lapply, but not for dataframe returns.

# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
                  java=list(), javascript=list(), ruby=list(), `c++`=list())

LangLists <- setNames(mapply(function(i, j){

  df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
  df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
  df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))

  return(list(as.matrix(df)))   # MATRIX TYPE
  # return(list(df))            # DF TYPE

}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------

# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    cbind(mean(as.numeric(df[,5])),
          mean(as.numeric(df[,6])))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
  cbind(mean(as.numeric(i[,5])),
        mean(as.numeric(i[,6])))      
}, classes="matrix", how="list")

all.equal(LangLists1, LangLists2)
# [1] TRUE


# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    data.frame(year_rep=mean(df$year_rep),
               total_rep=mean(df$total_rep))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
    data.frame(year_rep=mean(i$year_rep),
               total_rep=mean(i$total_rep))      
}, classes="data.frame", how="list")

all.equal(LangLists1, LangLists2)

# [1] "Component "c#": Component 1: Names: 2 string mismatches"                                               
# [2] "Component "c#": Component 1: Attributes: < names for target but not for current >"                     
# [3] "Component "c#": Component 1: Attributes: < Length mismatch: comparison on first 0 components >"        
# [4] "Component "c#": Component 1: Length mismatch: comparison on first 2 components"                        
# [5] "Component "c#": Component 1: Component 1: Modes: numeric, NULL"  
...

实际上,尽管嵌套的lapply保留了 rep 表示的两列完整数据帧的列表,但数据帧的rapply却将基础数据帧转换为NULL列表.再次,为什么与向量/矩阵相比,rapply为什么无法返回原始数据帧列表?

In fact, whereas the nested lapply remains a list of intact dataframes of the two columns for rep means, the rapply for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?

# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL

# $`c#`[[1]]$user
# NULL

# $`c#`[[1]]$link
# NULL

# $`c#`[[1]]$location
# NULL

# $`c#`[[1]]$year_rep
# NULL

# $`c#`[[1]]$total_rep
# NULL

# $`c#`[[1]]$tag1
# NULL

# $`c#`[[1]]$tag2
# NULL

# $`c#`[[1]]$tag3
# NULL

# $python
# $python[[1]]
# $python[[1]]$X
# NULL

# $python[[1]]$user
# NULL

# $python[[1]]$link
# NULL

# $python[[1]]$location
# NULL

# $python[[1]]$year_rep
# NULL

# $python[[1]]$total_rep
# NULL

# $python[[1]]$tag1
# NULL

# $python[[1]]$tag2
# NULL

# $python[[1]]$tag3
# NULL

推荐答案

看来rapply并非旨在处理data.frames列表.

It appears that rapply is not designed to process lists of data.frames.

?rapply的详细信息"部分中说,如果

From the Details section of ?rapply it says, if

由于data.frames是列表,因此它们不属于第一类.因此,它们属于所有其他包罗万象,并被dflt取代,其默认值为NULL.这说明了问题中最后一行代码的结果.

Since data.frames are lists, they do not fall under the first category. Thus, they fall into the all others catch-all and are replaced by dflt, whose default value is NULL. This explains the result of the final line of code in the question.

替换"方法的最后一个替代参数,看来在这种模式"下,数据.frames只是被忽略了

The final alternative argument to how is "replace" and it appears that data.frames are simply ignored under this "mode"

没有提及元素本身就是列表,而是使用how ="replace"来运行上面的代码似乎返回一个嵌套列表,其中data.frames现在是简单列表.因此,看来rapply通过并删除了class属性.

No mention of elements which are themselves lists and running the code above with how="replace" appears to return a nested list where what were data.frames are now simple lists. So it appears that rapply went through and stripped the class attribute.

这篇关于在数据框列表上运行rapply的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 22:18