将 XML 解析为数据帧 | 解析为数据帧

本文介绍了将 XML 解析为数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 XML 数据库的新手.

I'm new with XML data base.

我会尝试解释我的问题.

I would try to explain my problem.

有一个数据库存储在来自墨西哥政府页面的 xml 文件中，我正在尝试下载以用于我的分析.

There is a data base stored in an xml file from a Mexican goverment page that I'm trying to download to use for my analysis.

可以找到数据的页面是这个.

The page where you can find the data is this.

https://datos.gob.mx/busca/dataset/estaciones-de-servicio-gasolineras-y-precios-comerciales-de-gasolina-y-diesel

直接下载链接是这个，我觉得就像一个外部存储库.我真的不知道.

The direct download link is this, I think is like an external repository. Sincerely I don't know.

https://publicacionexterna.azurewebsites.net/publicaciones/prices

如果你点击上面的链接，xml格式的数据库会自动下载.

If you click the link above, the database in xml format is downloaded automatically.

该数据库是关于零售卖家的墨西哥汽油价格，以及他所在的全国十进制度数.

The database is about mexican gas prices from retail sellers whith his location in decimal degrees across the country.

我可以下载数据库并粘贴到 Windows .xls 文件，然后粘贴到 .csv 存档，然后上传到我的 R 环境进行分析.

I'm able to download the data base and paste to a windows .xls file and then a .csv archive and then uplod to my R enviroment for the analysis.

一般的问题是，当我尝试直接从页面下载到我的 R 环境时，我无法获得允许我执行分析的结构化数据库格式.

The general problem is when I'm trying to download directly from the page to my R enviroment I'm not able to obtain an structured data base format that allows me perform the analysis.

我正在获取重复的行，并且无法提取每个数据级别的所有属性.

I'm obtaining duplicate rows and, cannot extract all the attributes for each level of the data.

这是我自己写的脚本，并在互联网上寻求帮助.

This is the script that I was able to write by my self, and looking for help in the internet.

# CRE FILES

library(easypackages)

my_packages <- c("rlist","readr", "tidyverse", "lubridate", "stringr",
"rebus", "stringi", "purrr", "geosphere", "XML", "RCurl", "plyr")

libraries(my_packages)

# Link de descarga de documentos

link1 <-(https://publicacionexterna.azurewebsites.net/publicaciones/prices")

# First we load the xml file to the enviroment

data_prices <- getURL(link1)

xmlfile <- xmlParse(data_prices)

class(xmlfile)

xmltop <- xmlRoot(xmlfile)

base <- ldply(xmlToList(xmltop),data.frame)

问题是我希望日期作为另一列，而不是一行.感谢您的回答.

The problem is that I would like the date as another column, not as a row. Thank you for your answers.

推荐答案

这样的事情应该可以为您提供一个数据框，其中所有数据都在单独的列中.

Something like this should get you a dataframe with all of the data in separate columns.

library(RCurl)
library(XML)

# Set link to website
link1 <-("https://publicacionexterna.azurewebsites.net/publicaciones/prices")

# Get data from webpage
data_prices <- getURL(link1)

# Parse XML data
xmlfile <- xmlParse(data_prices)

# Get place nodes
places <- getNodeSet(xmlfile, "//place")

# Get values for each place
values <- lapply(places, function(x){
                          # Get current place id
                          pid <- xmlAttrs(x)

                          # Get values for each gas type for current place
                          newrows <- lapply(xmlChildren(x), function(y){
                                                              # Get type and update time values
                                                              attrs <- xmlAttrs(y)

                                                              # Get price value
                                                              price <- xmlValue(y)
                                                              names(price) <- "price"

                                                              # Return values
                                                              return(c(pid, attrs, price))
                                                            })
                          # Combine rows to single list
                          newrows <- do.call(rbind, newrows)

                          # Return rows
                          return(newrows)
                       })

# Combine all values into a single dataframe
df <- as.data.frame(do.call(rbind, values), stringsAsFactors = FALSE)

# Reset row names for dataframe
row.names(df) <- c(1:nrow(df))

这篇关于将 XML 解析为数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！