问题描述
我有一个如下表:
myDT <- fread(
"id,other,strformat,content
1, other1, A:B, a1:b1
2, other2, A:C, a2:c2
3, other3, B:A:C, b3:a3:c3
4, other4, A:B, a4:b4
5, other5, XX:whatever, xx5:whatever5
")
我想基于 strformat
拆分 content
列,以获取此信息:
And I want to split the content
column based on strformat
, to get this:
id other strformat content A B C XX whatever
1: 1 other1 A:B a1:b1 a1 b1 <NA> <NA> <NA>
2: 2 other2 A:C a2:c2 a2 <NA> c2 <NA> <NA>
3: 3 other3 B:A:C b3:a3:c3 a3 b3 c3 <NA> <NA>
4: 4 other4 A:B a4:b4 a4 b4 <NA> <NA> <NA>
5: 5 other5 XX:whatever xx5:whatever5 <NA> <NA> <NA> xx5 whatever5
我在 by =
上的 tstrsplit()
失败了:
myDT[, unlist(strsplit(strformat,':')):=tstrsplit(content,':'), by=strformat]
# Error in strsplit(strformat, ":") : object 'strformat' not found
所以现在我诉诸于使用循环:
So for now I resorted to using a cycle:
for (this.format in unique(myDT$strformat)){
myDT[strformat==this.format, unlist(strsplit(this.format,':')):=tstrsplit(content,':')]
}
它可以完成工作,但是我仍然想知道 by =
It does the job, but I'm still wondering what would be the right way with by=
推荐答案
因此,我测试了@akrun提出的3种解决方案,并进行了一些修改.跳过最后一个,因为它具有硬编码的列名.
So, I have tested 3 solutions kindly suggested by @akrun, with slight modifications. Skipped the last one because it has the column names hardcoded.
# define functions to compare:
require(splitstackshape)
f_csplit <- function(inpDT, col_format='strformat', col_content='content', sep=':'){
invisible(inpDT[dcast(
cSplit(inpDT, c(col_format, col_content), sep, "long"),
as.formula(paste('id',col_format,sep='~')),
value.var=col_content
), , on = .(id)])
}
f_lapply_str <- function(inpDT, col_format='strformat', col_content='content', sep=':'){
invisible(inpDT[dcast(
inpDT[, unlist(lapply(.SD, strsplit, sep), recursive = FALSE), by = id, .SDcols = 2:3],
as.formula(paste('id',col_format,sep='~')),
value.var=col_content
), on = .(id)])
}
require(tidyverse)
f_unnest <- function(inpDT, col_format='strformat', col_content='content', sep=':'){
invisible(inpDT[dcast(
unnest(inpDT[, lapply(.SD, tstrsplit, sep),by = id, .SDcols = 2:3]),
as.formula(paste('id',col_format,sep='~')),
value.var=col_content
), on = .(id)])
}
f_cycle <- function(inpDT, col_format='strformat', col_content='content', sep=':'){
inpDT <- copy(inpDT); # in fact I don't even need to make a copy:
# := modifies the original table which is fine for me -
# but for benchmarking let's make a copy
for (this.format in unique(inpDT[[col_format]])){
inpDT[get(col_format)==this.format, unlist(strsplit(this.format,sep)):=tstrsplit(get(col_content),sep)]
}
invisible(inpDT)
}
似乎解决方案#2( strsplit
的 lapply
,没有 cSplit
)和#3( unnest)
当我在表中有任何其他列时无法正常工作,仅当我删除其他"后才能正常工作:
It seems that solutions #2 (lapply
of strsplit
, without cSplit
) and #3 (unnest)
don't work correctly when I have any other columns in the table, it only works if I remove "other":
myDT[dcast(myDT[, unlist(lapply(.SD, strsplit, ":"), recursive = FALSE), by = id, .SDcols = 2:3], id ~ strformat), on = .(id)]
# id other strformat content A B C XX whatever
# 1: 1 other1 A:B a1:b1 A B <NA> <NA> <NA>
# 2: 2 other2 A:C a2:c2 A <NA> C <NA> <NA>
# 3: 3 other3 B:A:C b3:a3:c3 A B C <NA> <NA>
# 4: 4 other4 A:B a4:b4 A B <NA> <NA> <NA>
# 5: 5 other5 XX:whatever xx5:whatever5 <NA> <NA> <NA> XX whatever
myDT[dcast(unnest(myDT[, lapply(.SD, tstrsplit, ":"),by = id, .SDcols = 2:3]), id ~ strformat), on = .(id)]
# (same result as above)
myDT$other <- NULL
myDT[dcast(myDT[, unlist(lapply(.SD, strsplit, ":"), recursive = FALSE), by = id, .SDcols = 2:3], id ~ strformat), on = .(id)]
# id strformat content A B C XX whatever
# 1: 1 A:B a1:b1 a1 b1 <NA> <NA> <NA>
# 2: 2 A:C a2:c2 a2 <NA> c2 <NA> <NA>
# 3: 3 B:A:C b3:a3:c3 a3 b3 c3 <NA> <NA>
# 4: 4 A:B a4:b4 a4 b4 <NA> <NA> <NA>
# 5: 5 XX:whatever xx5:whatever5 <NA> <NA> <NA> xx5 whatever5
myDT[dcast(unnest(myDT[, lapply(.SD, tstrsplit, ":"),by = id, .SDcols = 2:3]), id ~ strformat), on = .(id)]
# (same correct result as above)
以下是基准测试,其中删除了其他"列:
Here is the benchmarking with "other" columns removed:
# make a bigger table based on a small one:
myDTbig <- myDT[sample(nrow(myDT),1e5, replace = T),]
myDTbig[, id:=seq_len(nrow(myDTbig))]
myDTbig$other <- NULL
require(microbenchmark)
print(microbenchmark(
f_csplit(myDTbig),
f_lapply_str(myDTbig),
f_unnest(myDTbig),
f_cycle(myDTbig),
times=10L
), signif=2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_csplit(myDTbig) 420 430 470 440 450 670 10
# f_lapply_str(myDTbig) 4200 4300 4700 4700 5100 5400 10
# f_unnest(myDTbig) 3900 4400 4500 4500 4800 5100 10
# f_cycle(myDTbig) 88 96 98 98 100 100 10
并保留其他"列:
# make a bigger table based on a small one:
myDTbig <- myDT[sample(nrow(myDT),1e5, replace = T),]
myDTbig[, id:=seq_len(nrow(myDTbig))]
require(microbenchmark)
print(microbenchmark(
f_csplit(myDTbig),
f_cycle(myDTbig),
times=100L
), signif=2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_csplit(myDTbig) 410 440 500 460 490 1300 100
# f_cycle(myDTbig) 84 93 110 96 100 270 100
以下是我的真实数据集.好吧,实际上,只有它的1/10:在完整的代码中,我在 csplit
解决方案上遇到了内存分配错误(而有周期的代码工作得很好).
And below is with my real dataset. Well, actually, only 1/10th of it: with the full one I had memory allocation error on csplit
solution (while the one with the cycle worked fine).
myDTbig <- dt.vcf[1:2e6,]
myDTbig[,id:=seq_len(nrow(myDTbig))]
print(microbenchmark(
f_csplit(myDTbig, 'FORMAT', 'S_1'),
f_cycle(myDTbig, 'FORMAT', 'S_1'),
times=5L
), signif=2)
# Unit: seconds
# expr min lq mean median uq max neval
# f_csplit(myDTbig, "FORMAT", "S_1") 15.0 16.0 16 16.0 16.0 17.0 5
# f_cycle(myDTbig, "FORMAT", "S_1") 4.9 4.9 6 5.7 5.8 8.5 5
最后,我测试了 format
列中是否有多个级别(即,我们必须运行多少个周期)是否会增加具有该周期的解决方案的时间:
Finally, I tested if having many levels in format
column (i.e. how many cycles we have to run) will increase the time for the solution with the cycle:
myDTbig <- myDT[sample(nrow(myDT),1e6, replace = T),]
myDTbig[, strformat:=paste0(strformat,sample(letters,1e6, replace = T)),]
length(unique(myDTbig$strformat)) # 104
myDTbig[, id:=seq_len(nrow(myDTbig))]
print(microbenchmark(
f_csplit(myDTbig),
f_cycle(myDTbig),
times=10L
), signif=2)
# Unit: seconds
# expr min lq mean median uq max neval
# f_csplit(myDTbig) 7.3 7.4 7.7 7.6 7.9 8.4 10
# f_cycle(myDTbig) 2.7 2.9 3.0 2.9 3.0 3.8 10
因此,结论是-令人惊讶的是,此任务的执行周期比其他任何操作都要好.
So, as a conclusion - surprisingly, the cycle performed better than anything else for this task.
这篇关于一轮将tstrsplit拆分到不同的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!