我想做类似于thread中的解决方案的操作,在该方法中,我有两个数据框,并且我想找到重叠的区域,然后将相应的数据合并到匹配中
>x1
chr start stop CN
1 1 10 140 G
2 1 100 1000 G
3 1 1500 5000 L
>x2
chr start stop gene
1 1 1 100 a
2 1 100 150 b
3 1 190 1000 c
4 1 1000 2000 d
5 1 2000 5000 e
我可以找到与以下代码重叠的区域:
library(GenomicRanges)
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
gr2 = with(x2, GRanges(chr, IRanges(start=start, end=stop)))
hits = findOverlaps(gr1, gr2)
匹配显示x1中与x2重叠的区域,例如:
> hits
Hits of length 8
queryLength: 3
subjectLength: 5
queryHits subjectHits
<integer> <integer>
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 2 4
7 3 4
8 3 5
我想做的是输出包含x1和x2的基因和CN信息。输出看起来像这样
x1chr x1start x1stop x1CN x2chr x2start x2stop x2gene
1 1 10 140 g 1 1 100 a
2 1 10 140 g 1 100 150 b
3 1 100 1000 g 1 1 100 a
4 1 100 1000 g 1 100 150 b
5 1 100 1000 g 1 190 1000 c
6 1 100 1000 g 1 1000 2000 d
7 1 1500 5000 l 1 1000 2000 d
8 1 1500 5000 l 1 2000 5000 e
最佳答案
您可以使用foverlaps
包中的data.table
library(data.table)
setkey(setDT(x1), start, stop)
setkey(setDT(x2), start, stop)
foverlaps(x2, x1)
# chr start stop CN i.chr i.start i.stop gene
#1: 1 10 140 G 1 1 100 a
#2: 1 100 1000 G 1 1 100 a
#3: 1 10 140 G 1 100 150 b
#4: 1 100 1000 G 1 100 150 b
#5: 1 100 1000 G 1 190 1000 c
#6: 1 100 1000 G 1 1000 2000 d
#7: 1 1500 5000 L 1 1000 2000 d
#8: 1 1500 5000 L 1 2000 5000 e
关于r - 重叠的基因组区间和合并的数据集,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30204672/