问题描述
我关注了 Hadley 的帖子:使用 rbind 将多个 .csv 文件加载到 R 中的单个数据帧中的问题 读取多个 CSV
文件,然后将它们转换为一个数据帧.我还尝试了 lapply
与 sapply
的对比,如 分组函数(tapply、by、aggrega)和*apply族.
I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV
files and then convert them to one dataframe. I also experimented with lapply
vs. sapply
as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.
这是我的第一个 CSV 文件:
Here's my first CSV file:
dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
这是我的第二个 CSV 文件:
Here's my second CSV file:
dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
这是我的代码:
dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))
虽然这很好用,但我想将 lapply
更改为 sapply
.从上面的线程中,我意识到 sapply
会将读取因子从 csv
文件更改为矩阵,但我不确定为什么翻转字段.例如,Income
字段占用第 3 行和第 8 行,但不在一列中.
While this works beautifully, I wanted to change lapply
to sapply
. From the above thread, I realize that sapply
would change the read factors from csv
file to matrices, but I am unsure why the fields are flipped. For instance, Income
field occupies row#3 and row#8, but are not in one column.
代码如下:
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
# change lapply to sapply
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))
输出如下:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 1 1 1
[2,] 1 2 2 2 2
[3,] 55 23 34 45 44
[4,] 23 21 22 24 25
[5,] 3 3 1 4 2
[6,] 1 2 1 1 1
[7,] 1 2 2 2 2
[8,] 55 55 55 55 55
[9,] 24 24 24 24 24
[10,] 3 3 1 4 2
我很感激任何帮助.我对 R 相当陌生,不确定发生了什么.
I'd appreciate any help. I am fairly new to R and not sure what's going on.
推荐答案
这个问题与因素无关,它是通用的sapply
vs lapply
.为什么 sapply
会出错而 lapply
会正确?请记住,在 R 中,数据框是列列表.并且每一列都可以有不同的类型.
The issue had nothing to do with factors, it's generic sapply
vs lapply
.Why does sapply
get it so wrong whereas lapply
gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.
lapply
返回一个列列表给rbind
,它正确地进行连接.它将相应的列保持在一起.所以你的因素正确出现.sapply
但是...- 返回一个数字矩阵...(因为矩阵只能有一种类型,与数据帧不同)
- ...更糟糕的是,有一个不需要的转置
- so
sapply
将您的两个 5x6 输入数据帧转换为转置的 6x5 矩阵(列现在对应于行)... - 所有数据都被强制转换为数字(垃圾!).
- then
rbind
row-连接"这两个垃圾 6x5 数字矩阵到一个非常垃圾的 12x5 矩阵中.由于列已转为行,因此行连接矩阵组合了数据类型,显然您的因素被搞乱了.
lapply
returns a list-of-columns torbind
, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.sapply
however...- returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
- ...which, worse still, has an unwanted transpose
- so
sapply
turns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)... - with all data coerced to numeric (garbage!).
- then
rbind
row-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.
总结:只需使用
lapply
这篇关于sapply 与 lapply 在读取文件并绑定它们时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!