问题描述
fileA
包含间隔(开始,结束)和分配给该间隔的值(值).
fileA
contains intervals (start, end), and a value assigned to that interval (value).
start end value
0 123 1 #value 1 at positions 0 to 122 included.
123 78000 0 #value 0 at positions 123 to 77999 included.
78000 78004 56 #value 56 at positions 78000, 78001, 78002 and 78003.
78004 78005 12 #value 12 at position 78004.
78005 78006 1 #value 1 at position 78005.
78006 78008 21 #value 21 at positions 78006 and 78007.
78008 78056 8 #value 8 at positions 78008 to 78055 included.
78056 81000 0 #value 0 at positions 78056 to 80999 included.
fileB
包含我感兴趣的间隔的列表.我想从fileA
检索重叠的间隔.开始和结束不一定匹配.这是fileB
的示例:
fileB
contains a list of the intervals I am interested in. I would like to retrieve the overlapping intervals from fileA
. The starts and ends do not necessarily match. Here is an example of fileB
:
start end label
77998 78005 romeo
78007 78012 juliet
目标是(1)从fileA
检索与fileB
重叠的间隔,以及(2)从fileB
附加相应的标签.预期的结果是(#表示被丢弃的行,这是为了帮助可视化,并且不会出现在最终输出中):
The goal is to (1) retrieve the intervals from fileA
that overlap with fileB
and (2) to append the corresponding labels from fileB
. The expected result is (# to designate the lines that were discarded, this is to help visualize and will not be in the final output):
start end value label
#
123 78000 0 romeo
78000 78004 56 romeo
78004 78005 12 romeo
#
78006 78008 21 juliet
78008 78056 8 juliet
#
这是我编写代码的尝试:
Here is my attempt at writing code:
#read from tab-delimited text files which do not contain column names
A<-read.table("fileA.txt",sep="\t",colClasses=c("numeric","numeric","numeric"))
B<-read.table("fileB.txt",sep="\t",colClasses=c("numeric","numeric","character"))
#add column names
colnames(A)<-c("start","end","value")
colnames(B)<-c("start","end","label")
#output intervals in `fileA` that overlap with an interval in `fileB`
A_overlaps<-A[((A$start <= B$start & A$end >= B$start)
|(A$start >= B$start & A$start <= B$end)
|(A$end >= B$start & A$end <= B$end)),]
在这一点上,我已经得到了意想不到的结果:
At this point I am already getting unexpected results:
> A_overlaps
start end value
#missing
3 78000 78004 56
5 78005 78006 1 #this line should not be here
6 78006 78008 21
#missing
我还没有写出输出标签的零件,因为我可能最好先解决这个问题,但是我无法弄清楚我出了什么问题...
I didn't write the part to output the labels yet because I might as well fix this first, but I can't figure out what I am getting wrong...
我也尝试了以下方法,但它只输出fileA
的全部:
I also tried the following but it just outputs the entirety of fileA
:
A_overlaps <- A[(min(A$start,A$end) < max(B$start,B$end)
& max(A$start,A$end) > min(B$start,B$end)),]
推荐答案
这会产生所需的输出,但可能有点难以理解
This produces desired output, but may be a little difficult to read
# function to find, if value lies in interval
is.between <- function(x, a, b) {
(x - a) * (b - x) > 0
}
# apply to all rows in A
> matching <- apply(A, MARGIN=1, FUN=function(x){
# which row fulfill following condition:
+ which(apply(B, MARGIN=1, FUN=function(y){
# first value lies in interval from B or second value lies in interval from B
+ is.between(as.numeric(x[1]), as.numeric(y[1]), as.numeric(y[2])) | is.between(as.numeric(x[2]), as.numeric(y[1]), as.numeric(y[2]))
+ }))
+ })
>
# print the results
> matching
[[1]]
integer(0)
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
[[5]]
integer(0)
[[6]]
[1] 2
[[7]]
[1] 2
[[8]]
integer(0)
>
# filter those, which has 0 length = no matching
> A_overlaps <- A[unlist(lapply(matching, FUN=function(x)length(x)>0)),]
# add label
> A_overlaps$label <- B$label[unlist(matching)]
>
> A_overlaps
start end value label
2 123 78000 0 romeo
3 78000 78004 56 romeo
4 78004 78005 12 romeo
6 78006 78008 21 juliet
7 78008 78056 8 juliet
这篇关于R-输出重叠间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!