问题描述
我在xml文件中有一个大型蛋白质数据库,我需要使用R提取一些信息.该数据库由条目组成,其中包含有关我需要提取和格式化的特定蛋白质的信息.
I have a large database of proteins in an xml file that I need to extract some information from using R. The database is organized by entries, which contain information about the specific protein that I need to extract and format.
https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml
我想提取名称,所有类型为"EC"的dbReferences以及每个条目的顺序.到目前为止,我有:
I would like to extract the name, all the dbReferences that are type "EC", and the sequence for each entry. So far I have:
library("XML")
doc <- xmlParse("Sample.xml")
我在考虑使用xpathSApply
函数显式选择要访问的标签,还是使用xmlToDataFrame
函数.我是R的新手,所以我对从哪里开始感到困惑.
I was thinking of either using the xpathSApply
function to explicitly pick tags to go to, or the xmlToDataFrame
function. I'm new to R, so I'm a bit confused as to where to begin.
推荐答案
只需从getNodeSet中选择所需的元素
Just select the elements you need from getNodeSet
nd <- getNodeSet(doc, "//ns:entry", namespaces=c(ns=getDefaultNamespace(doc)[[1]]$uri))
y <- data.frame( id = sapply(nd, xpathSApply, './*[local-name()="name"]', xmlValue),
ec = sapply(nd, function(y) paste( xpathSApply(y, './/*[local-name()="dbReference" and @type="EC"]/@id'), collapse="; ")),
sequence = gsub("\n", "", sapply(nd, xpathSApply, './*[local-name()="sequence"]', xmlValue)))
head(y, 3)
id ec sequence
1 AK1C3_HUMAN 1.-.-.-; 1.1.1.357; 1.1.1.112; 1.1.1.188; 1.1.1.239; 1.1.1.64; 1.3.1.20 MDSKHQCVKLNDGHFMPVLGFGTYAPPEVPRSKALEVTKLAIEA...
2 CP3A4_HUMAN 1.14.13.-; 1.14.13.157; 1.14.13.32; 1.14.13.67; 1.14.13.97 MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPL...
3 AK1C1_HUMAN 1.1.1.-; 1.1.1.149; 1.1.1.112; 1.3.1.20 MDSKYQCVKLNDGHFMPVLGFGTYAPAEVPKSKALEATKLAIEA...
您还可以删除名称空间并简化这些查询...
You could also drop the namespace and simplify these queries...
x <- readLines("Sample.xml")
x[2] <- "<uniprot>"
doc <- xmlParse(x)
nd <- getNodeSet(doc, "//entry")
或者改用Uniprot的Rest服务
OR use the Rest services from Uniprot instead
这篇关于使用R从XML文件提取数据的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!