比较基因组学:如何比较序列范围

本文介绍了比较基因组学:如何比较序列范围的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 VISTA 对两种细菌进行了基因组比较.

该工具为我提供了两种细菌之间共有的 DNA 序列区域，但我最感兴趣的是知道一种细菌中存在哪些 CDS，而第二种细菌中缺少哪些 CDS

通过使用 R，我设法使用 VISTA 信息生成一个 data.frame，其中包括第一个细菌独有的碱基区域(范围).这些区域可能包含第二个区域所缺乏的基因 (CDS).

I did a genome comparison between two bacteria with VISTA.

This tools gave me the regions of DNA sequence that are common between two bacteria, but I am most interested in knowing which CDS's are present in one bacteria that is lacking in the second one

By using R, I managed to use the VISTA information to generate a data.frame which includes the region (range) of bases that are exclusive to the FIRST bacteria. These regions must presumibly containing genes (CDS's) that are lacking in the second one.

head(rango_vacio)  # Regions (mapped bp) exclusive to the first bacteria
   V1      V2
11552   13259
13365   13263
37168   37169
.....   .....

另一方面，我已经处理了相同细菌的 gff 文件以提取 CDS 序列.此数据框包含每个 CDS 的开头和结尾，以及相应蛋白质的登录名

By the other way, I have processed a gff file of this same bacteria to extract the CDS sequences. This dataframe contains the start and the end of each of the CDS, along the accession name of the corresponding protein

head(cds_TIGR4) # A list of the cds of this bacteria
startbp   endbp   accession
197       1158    NP_344444
1717      2853    NP_344445
2864      3112    NP_344446
.....     ....    .....

重要提示:数据帧rango_vacio"和cds_TIGR4"都使用与每个碱基相同的参考，因此我可以比较两者

现在，我的问题的答案应该很容易完成，因为我只需要使用 CDS 本身的范围作为参考，找出每个rango_vacio 范围中存在哪些 CDS

IMPORTANT: The data-frames "rango_vacio" and "cds_TIGR4" are both using the same reference as per base is concerned, so I can compare both

Now, the answer to my question should be easy to accomplish, because I only need to find what CDS's are present in each of the rango_vacio ranges using as reference the range of the CDS itself

我可以通过使用一组非常复杂的 for 循环来做到这一点，但我想向你们中的任何人学习是否可以通过任何其他更短的方法来完成

I can do it by using a very complicated set of for loops, but I am wondering to learn from any of you if this can be accomplished by any other shorter approach

推荐答案

到最后，我相信我已经找到了自己的方法

GenomicRanges 不能在我的情况下使用，因为我的 data.frame 之一包含 cds 范围，包括 strandness.另一个只包含一个范围

所以我改用了 IRange 包.

我简化了两个数据帧，以包含范围的开始和结束以及 cd.一个叫rango，另一个叫cds

At the very end, I believe I have found myself the method

GenomicRanges cannot be used in my case because one of my data.frame contains the cds range including the strandness. The other one only contains a range

So I used the IRange package instead.

I simplify the two data-frames, to contain the start and the end of the range and the cds. One named rango, and the other one cds

library(IRanges)
ir_rango <- IRanges(rango[,1], rango[,2])
ir_cds <- IRanges(cds[,1], cds[,2])
common <- findOverlaps(ir_cds, ir_rango)
common <- as.matrix(common)
unique_cds <- common[,1]
uniques <- which(duplicated(unique_cds))
uniques

uniques 包含 ir_cds 中显示的相应范围的行号.现在我只需要提取 cds 的名称

uniques contains the row number of the corresponding range shown in ir_cds. Now I only need to extract the name of the cds

这篇关于比较基因组学:如何比较序列范围的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！