问题描述
我正在寻找加速代码的方法.我正在研究apply
/ply
方法以及data.table
.不幸的是,我遇到了问题.
I am looking for ways to speed up my code. I am looking into the apply
/ply
methods as well as data.table
. Unfortunately, I am running into problems.
这是一个小样本数据:
ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2)
ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4)
chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a")
chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a")
data <- data.frame(col1 = ids1, col2 = ids2,
col3 = chars1, col4 = chars2,
stringsAsFactors = FALSE)
这是使用循环的解决方案:
Here is a solution using loops:
library("plyr")
cols_to_fix <- c("col3","col4")
for (i in 1:length(cols_to_fix)) {
data[,cols_to_fix[i]] <- gsub("_", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- gsub(" ", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- ifelse(data[,cols_to_fix[i]]=="n/a", NA, data[,cols_to_fix[i]])
}
我最初看过ddply
,但是我想使用的某些方法仅采用矢量.因此,我无法弄清楚如何仅对某些列进行ddply
I initially looked at ddply
, but some methods I want to use only take vectors. Hence, I cannot figure out how to do ddply
across just certain columns one-by-one.
此外,我一直在查看laply
,但是我想返回更改后的原始data.frame
.谁能帮我?谢谢.
Also, I have been looking at laply
, but I want to return the original data.frame
with the changes. Can anyone help me? Thank you.
根据先前的建议,这是我尝试从plyr
软件包中使用的内容.
Based on the suggestions from earlier, here is what I tried to use from the plyr
package.
选项1:
data[,cols_to_fix] <- aaply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text",.drop = FALSE)
选项2:
data[,cols_to_fix] <- alply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
选项3:
data[,cols_to_fix] <- adply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
这些都不给我正确的答案.
None of these are giving me the correct answer.
apply
效果很好,但是我的数据非常大,并且plyr
包中的进度条非常好.再次感谢.
apply
works great, but my data is very large and the progress bars from plyr
package would be a very nice. Thanks again.
推荐答案
这是使用set
的data.table
解决方案.
Here's a data.table
solution using set
.
require(data.table)
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
DT
# col1 col2 col3 col4
# 1: 1 1 aa vv
# 2: 1 2 bb ww
# 3: 1 3 cc xx
# 4: 1 4 dd yy
# 5: 2 1 ee zz
# 6: 2 2 NA NA
# 7: 2 3 NA NA
# 8: 2 4 NA NA
注意:使用PCRE(perl=TRUE
)可以提高速度,特别是在较大的向量上.
Note: Using PCRE (perl=TRUE
) has nice speed-up, especially on bigger vectors.
这篇关于R plyr data.table,应用data.frame的某些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!