问题描述
我有一个数据框架,两列联系人字符串。在一列中(命名为 probes
)我有重复的情况(也就是说,几个情况下使用相同的字符串)。对于探针中的每种情况,我想查找包含相同字符串的所有案例,然后将第二列(名为基因
)中的所有相应案例的值合并到单个案例例如,如果我有这样的结构:
探针基因
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
我要更改这个结构:
探针基因
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
显然没有问题,这样做一个单一的探针使用哪个,然后粘贴和折叠
ind< - 其中(olap $ probes ==cg00061679)
genename< ;-( olap [ind,2])
genecomb< -paste(genename [1:length(genename)],collapse =)
但我不知道如何在整个数据帧中提取probe列中的重复索引。任何想法?
提前感谢
code>在基础R中单击
data.frame(probes = unique(olap $探针),
基因=自由(olap $ genes,olap $ probes,paste,collapse =))
或使用plyr:
library(plyr)
ddply(olap,probes总结基因= paste(基因,collapse =))
更新
在第一个版本中可能更安全:
只要以独一无二的方式将探测器以不同的顺序发送到 tapply
。我个人总是使用 ddply
。
I have a data frame with two columns contacting character strings. in one column (named probes
) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes
) into a single case.for example, if I have this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
I want to change it to this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
Thanks in advance
You can use tapply
in base R
data.frame(probes=unique(olap$probes),
genes=tapply(olap$genes, olap$probes, paste, collapse=" "))
or use plyr:
library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))
UPDATE
It's probably safer in the first version to do this:
tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)
Just in case unique gives the probes in a different order to tapply
. Personally I would always use ddply
.
这篇关于R在一列中查找重复项,并在第二列中折叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!