我想做类似于thread中的解决方案的操作,在该方法中,我有两个数据框,并且我想找到重叠的区域,然后将相应的数据合并到匹配中

>x1
  chr start stop CN
1   1    10  140  G
2   1   100 1000  G
3   1  1500 5000  L



>x2
  chr start stop gene
1   1     1  100    a
2   1   100  150    b
3   1   190 1000    c
4   1  1000 2000    d
5   1  2000 5000    e

我可以找到与以下代码重叠的区域:
library(GenomicRanges)
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
gr2 = with(x2, GRanges(chr, IRanges(start=start, end=stop)))

hits = findOverlaps(gr1, gr2)

匹配显示x1中与x2重叠的区域,例如:
> hits
Hits of length 8
queryLength: 3
subjectLength: 5
  queryHits subjectHits
   <integer>   <integer>
 1         1           1
 2         1           2
 3         2           1
 4         2           2
 5         2           3
 6         2           4
 7         3           4
 8         3           5

我想做的是输出包含x1和x2的基因和CN信息。输出看起来像这样
 x1chr x1start x1stop x1CN x2chr x2start x2stop x2gene
1     1      10    140    g     1       1    100      a
2     1      10    140    g     1     100    150      b
3     1     100   1000    g     1       1    100      a
4     1     100   1000    g     1     100    150      b
5     1     100   1000    g     1     190   1000      c
6     1     100   1000    g     1    1000   2000      d
7     1    1500   5000    l     1    1000   2000      d
8     1    1500   5000    l     1    2000   5000      e

最佳答案

您可以使用foverlaps包中的data.table

library(data.table)
setkey(setDT(x1), start, stop)
setkey(setDT(x2), start, stop)
foverlaps(x2, x1)
#   chr start stop CN i.chr i.start i.stop gene
#1:   1    10  140  G     1       1    100    a
#2:   1   100 1000  G     1       1    100    a
#3:   1    10  140  G     1     100    150    b
#4:   1   100 1000  G     1     100    150    b
#5:   1   100 1000  G     1     190   1000    c
#6:   1   100 1000  G     1    1000   2000    d
#7:   1  1500 5000  L     1    1000   2000    d
#8:   1  1500 5000  L     1    2000   5000    e

关于r - 重叠的基因组区间和合并的数据集,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30204672/

10-12 17:11