问题描述
我使用 VISTA 对两种细菌进行了基因组比较.
该工具为我提供了两种细菌之间共有的 DNA 序列区域,但我最感兴趣的是知道一种细菌中存在哪些 CDS,而第二种细菌中缺少哪些 CDS
通过使用 R,我设法使用 VISTA 信息生成一个 data.frame,其中包括第一个细菌独有的碱基区域(范围).这些区域可能包含第二个区域所缺乏的基因 (CDS).
I did a genome comparison between two bacteria with VISTA.
This tools gave me the regions of DNA sequence that are common between two bacteria, but I am most interested in knowing which CDS's are present in one bacteria that is lacking in the second one
By using R, I managed to use the VISTA information to generate a data.frame which includes the region (range) of bases that are exclusive to the FIRST bacteria. These regions must presumibly containing genes (CDS's) that are lacking in the second one.
head(rango_vacio) # Regions (mapped bp) exclusive to the first bacteria
V1 V2
11552 13259
13365 13263
37168 37169
..... .....
另一方面,我已经处理了相同细菌的 gff 文件以提取 CDS 序列.此数据框包含每个 CDS 的开头和结尾,以及相应蛋白质的登录名
By the other way, I have processed a gff file of this same bacteria to extract the CDS sequences. This dataframe contains the start and the end of each of the CDS, along the accession name of the corresponding protein
head(cds_TIGR4) # A list of the cds of this bacteria
startbp endbp accession
197 1158 NP_344444
1717 2853 NP_344445
2864 3112 NP_344446
..... .... .....
重要提示:数据帧rango_vacio"和cds_TIGR4"都使用与每个碱基相同的参考,因此我可以比较两者
现在,我的问题的答案应该很容易完成,因为我只需要使用 CDS 本身的范围作为参考,找出每个rango_vacio 范围中存在哪些 CDS
IMPORTANT: The data-frames "rango_vacio" and "cds_TIGR4" are both using the same reference as per base is concerned, so I can compare both
Now, the answer to my question should be easy to accomplish, because I only need to find what CDS's are present in each of the rango_vacio ranges using as reference the range of the CDS itself
我可以通过使用一组非常复杂的 for 循环来做到这一点,但我想向你们中的任何人学习是否可以通过任何其他更短的方法来完成
I can do it by using a very complicated set of for loops, but I am wondering to learn from any of you if this can be accomplished by any other shorter approach
推荐答案
到最后,我相信我已经找到了自己的方法
GenomicRanges 不能在我的情况下使用,因为我的 data.frame 之一包含 cds 范围,包括 strandness.另一个只包含一个范围
所以我改用了 IRange 包.
我简化了两个数据帧,以包含范围的开始和结束以及 cd.一个叫rango,另一个叫cds
At the very end, I believe I have found myself the method
GenomicRanges cannot be used in my case because one of my data.frame contains the cds range including the strandness. The other one only contains a range
So I used the IRange package instead.
I simplify the two data-frames, to contain the start and the end of the range and the cds. One named rango, and the other one cds
library(IRanges)
ir_rango <- IRanges(rango[,1], rango[,2])
ir_cds <- IRanges(cds[,1], cds[,2])
common <- findOverlaps(ir_cds, ir_rango)
common <- as.matrix(common)
unique_cds <- common[,1]
uniques <- which(duplicated(unique_cds))
uniques
uniques 包含 ir_cds 中显示的相应范围的行号.现在我只需要提取 cds 的名称
uniques contains the row number of the corresponding range shown in ir_cds. Now I only need to extract the name of the cds
这篇关于比较基因组学:如何比较序列范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!