问题描述
我有一个data.frame1像:
I have a data.frame1 like:
1 bin chrom chromStart chromEnd name score
2 12 chr1 29123222 29454711 -5.7648 599
3 116 chr1 45799118 45986770 -4.8403 473
4 117 chr1 46327104 46490961 -5.3036 536
5 121 chr1 50780759 51008404 -4.4165 415
6 133 chr1 63634657 63864734 -4.8096 469
7 147 chr1 77825305 78062178 -5.4671 559
我还有一个data.frame2如:
I also have a data.frame2 like:
chrom chromStart chromEnd N
1 chr1 63600000 63700000 1566
2 chr1 45800000 45900000 1566
3 chr1 29100000 29400000 1566
4 chr1 50400000 50500000 1566
5 chr1 46500000 46600000 1566
在data.frame1中,我的值范围基本为 chromStart
到 chromEnd
。我想将这些范围缩小到仅与 data.frame2
中的范围重叠的范围。例如, df1
的第一个范围是2912322到29454711.我想将该范围缩小到2912322到29400000,因为这是唯一与范围重叠的范围 df2
。有没有人知道我该怎么做?
Basically I have ranges of values from chromStart
to chromEnd
in data.frame1. I want to cut those ranges down to only ranges that overlap with my ranges in data.frame2
. For example, the first range of df1
is 2912322 to 29454711. I would like to cut that range down to 2912322 to 29400000 because that is the only range that overlaps with a range from df2
. Does anyone know how I could do this?
我想要的输出是一个数据框架,如:
The output I want is a data.frame like:
1 bin chrom chromStart chromEnd name score
2 12 chr1 29123222 29400000 -5.7648 599
3 116 chr1 45800000 45900000 -4.8403 473
6 133 chr1 63634657 63700000 -4.8096 469
以下是当前输出给我的数据框架:
Here is what the current output gives me for a data.frame:
chrom chromStart chromEnd bin name score
1 chr1 29123222 29130000 12 -5.7648 599
2 chr1 29123222 29140000 12 -5.7648 599
3 chr1 29123222 29150000 12 -5.7648 599
4 chr1 29123222 29160000 12 -5.7648 599
5 chr1 29123222 29170000 12 -5.7648 599
推荐答案
+1建议IRanges :: findOverlaps。
+1 for suggesting IRanges::findOverlaps.
解决方案使用 findOverlaps
和 GenomicRanges
:
library(GenomicRanges);
df1 <- cbind.data.frame(
bin = c(12, 116, 117, 121, 133, 147),
chrom = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
chromStart = c(29123222, 45799118, 46327104, 50780759, 63634657, 77825305),
chromEnd = c(29454711, 45986770, 46490961, 51008404, 63864734, 78062178),
name = c(-5.7648, -4.8403, -5.3036, -4.4165, -4.8096, -5.4671),
score = c(599, 473, 536, 415, 469, 559));
df2 <- cbind.data.frame(
chrom = c("chr1", "chr1", "chr1", "chr1", "chr1"),
chromStart = c(63600000, 45800000, 29100000, 50400000, 46500000),
chromEnd = c(63700000, 45900000, 29400000, 50500000, 46600000),
N = c(1566, 1566, 1566, 1566, 1566));
# Make GRanges objects from dataframes
gr1 <- with(df1, GRanges(
chrom,
IRanges(start = chromStart, end = chromEnd),
bin = bin,
name = name,
score = score));
gr2 <- with(df2, GRanges(
chrom,
IRanges(start = chromStart, end = chromEnd),
N = N));
# Get overlapping features
hits <- findOverlaps(query = gr1, subject = gr2);
# Get features from gr1 that overlap with features from gr2
idx1 <- queryHits(hits);
idx2 <- subjectHits(hits);
gr <- gr1[idx1];
# Make sure that we only keep the intersecting ranges
start(gr) <- ifelse(start(gr) >= start(gr2[idx2]), start(gr), start(gr2[idx2]));
end(gr) <- ifelse(end(gr) <= end(gr2[idx2]), end(gr), end(gr2[idx2]));
print(gr);
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | bin name score
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
[1] chr1 [29123222, 29400000] * | 12 -5.7648 599
[2] chr1 [45800000, 45900000] * | 116 -4.8403 473
[3] chr1 [63634657, 63700000] * | 133 -4.8096 469
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Turn GRanges into a dataframe
df <- data.frame(bin = mcols(gr)$bin,
chrom = seqnames(gr),
chromStart = start(gr),
chromEnd = end(gr),
name = mcols(gr)$name,
score = mcols(gr)$score);
print(df);
这篇关于如何修改R中的另一个文件的这些列范围?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!