问题描述
我想使用Bioconductor的GenomicFeatures和TxDb.Hsapiens.UCSC.hg19.knownGene R软件包从清单中获取人类基因的坐标(由hgnc基因id组成).
I want to get coordinates of human genes from my list (consisting of hgnc genes id) using GenomicFeatures and TxDb.Hsapiens.UCSC.hg19.knownGene R packages from Bioconductor.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb=(TxDb.Hsapiens.UCSC.hg19.knownGene)
my_genes = c("INO80","NASP","INO80D","SMARCA1")
select(txdb, keys = my_genes,
columns=c("TXCHROM","TXSTART","TXEND","TXSTRAND"),
keytype="GENEID")
但是,它不起作用,因为txdb不使用hgnc标识符.怎么解决呢?我找不到任何支持hgnc的适当键类型,也不确定如何匹配我拥有的hgnc id和txdb中的GENEID.
However, it doesn't' work because txdb doesn't take hgnc identifiers; how can it be solved? I couldn't find any appropriate keytype that will support hgnc and not sure how to match hgnc id I have and GENEID from txdb.
推荐答案
因为 txdb 是用于成绩单的,它没有(hgnc) geneSymbol ,但它具有 EntrezID .
Because txdb is for transcripts, and it doesn't have (hgnc) geneSymbol, but it has EntrezID.
首先,我们需要将 geneSymbol 映射到 EntrezID .
First, we need to map geneSymbol to EntrezID.
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
myGeneSymbols <- select(org.Hs.eg.db,
keys = c("INO80","NASP","INO80D","SMARCA1"),
columns = c("SYMBOL","ENTREZID"),
keytype = "SYMBOL")
# SYMBOL ENTREZID
# 1 INO80 54617
# 2 NASP 4678
# 3 INO80D 54891
# 4 SMARCA1 6594
然后我们可以子集txdb
:
myGeneSymbolsTx <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,
keys = myGeneSymbols$ENTREZID,
columns = c("GENEID", "TXID", "TXCHROM", "TXSTART", "TXEND"),
keytype = "GENEID")
# GENEID TXID TXCHROM TXSTART TXEND
# 1 54617 55599 chr15 41267988 41280172
# 2 54617 55600 chr15 41271079 41408340
# 3 54617 55601 chr15 41271079 41408340
# 4 4678 1229 chr1 46049660 46079853
# 5 4678 1230 chr1 46049660 46081143
# 6 4678 1231 chr1 46049660 46084578
# 7 4678 1232 chr1 46049660 46084578
# 8 4678 1233 chr1 46049660 46084578
# 9 4678 1234 chr1 46067733 46075197
# 10 4678 1235 chr1 46077135 46084578
# 11 54891 12593 chr2 206858445 206950906
# 12 6594 77970 chrX 128580478 128657460
# 13 6594 77971 chrX 128580478 128657460
# 14 6594 77972 chrX 128580740 128657460
# 15 6594 77973 chrX 128580740 128657460
如果需要,我们可以使用merge将 geneSymbol 添加到表中:
If required, we can then add geneSymbol to the table using merge:
res <- merge(myGeneSymbols, myGeneSymbolsTx, by.x = "ENTREZID", by.y = "GENEID")
# ENTREZID SYMBOL TXID TXCHROM TXSTART TXEND
# 1 4678 NASP 1229 chr1 46049660 46079853
# 2 4678 NASP 1230 chr1 46049660 46081143
# 3 4678 NASP 1231 chr1 46049660 46084578
# 4 4678 NASP 1232 chr1 46049660 46084578
# 5 4678 NASP 1233 chr1 46049660 46084578
# 6 4678 NASP 1234 chr1 46067733 46075197
# 7 4678 NASP 1235 chr1 46077135 46084578
# 8 54617 INO80 55599 chr15 41267988 41280172
# 9 54617 INO80 55600 chr15 41271079 41408340
# 10 54617 INO80 55601 chr15 41271079 41408340
# 11 54891 INO80D 12593 chr2 206858445 206950906
# 12 6594 SMARCA1 77970 chrX 128580478 128657460
# 13 6594 SMARCA1 77971 chrX 128580478 128657460
# 14 6594 SMARCA1 77972 chrX 128580740 128657460
# 15 6594 SMARCA1 77973 chrX 128580740 128657460
这篇关于HGNC基因名称的基因座标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!