背景:我正在注释生物体中来自GWAS的SNP,而没有太多注释。我正在使用UCSC的链式tBLASTn表以及biomaRt将每个SNP映射到可能的基因。

我有一个看起来像这样的数据框:

            SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5

最后,我想为每个SNP插入一行,并用逗号分隔基因和hu_mRNA。这是我的追求:
            SNP            hu_mRNA    gene
 chr1.111642529 NM_002107,NM_005324   H3F3A
 chr1.111801684  BC098118,NM_020435   GJC2
  chr1.11801605  AK027740,NM_032849   C13orf33
 chr1.151220354 NM_018913,NM_018918   PCDHGA10,PCDHGA5

现在,我知道可以在perl中轻拂一下腕部就可以做到这一点,但是我真的很想在R中做到这一点。

最佳答案

您可以将aggregatepaste一起使用,最后使用merge:

x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L,
5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084",
"chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L,
4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118",
"NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435",
"NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L,
3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A",
"H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP",
"hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L
))

a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",")
a2 <- aggregate(gene~SNP,data=x,paste,sep=",")
merge(a1,a2)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111801684             BC098118              <NA>
3 chr1.111925084            NM_020435              GJC2
4  chr1.11801605  AK027740, NM_032849    <NA>, C13orf33
5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5

关于r - 连接几列以按组逗号分隔字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/8083984/

10-12 17:11