本文介绍了将SAS sas7bdat数据读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R对于将本机SAS格式sas7bdat的文件读取到R中,R有什么选择?

What options does R have for reading files in the native SAS format, sas7bdat, into R?

例如, NCES通用核心包含一个广泛的数据文件存储库以这种格式.具体来说,让我们集中精力尝试读取LEA Universe的文件在1997-98年间,其中包含从A到I的所有州的实体在教育机构一级的人口统计信息.

The NCES Common Core, for example, contains an extensive repository of data files saved in this format. For concreteness, let's focus on trying to read in this file from LEA Universe in 1997-98, which contains education-agency-level demographics for entities in all states beginning A through I.

这是SAS数据预览:

将这些数据引入我的R环境的最简单方法是什么?我没有可用的SAS版本,也不愿意付费,因此仅将其转换为.csv会很麻烦.

What's the simplest way to bring this data in to my R environment? I don't have any version of SAS available and am not willing to pay, so simply converting it to .csv would be a hassle.

推荐答案

sas7bdat除了我正在查看的一个文件(特别是这一个);在向sas7bdat开发人员Matthew Shotwell报告错误时,他还向我指出了R中Hadley的haven程序包的方向,该程序包也具有read_sas方法.

sas7bdat worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven package in R which also has a read_sas method.

此方法优越的原因有两个:

This method is superior for two reasons:

1)读取上面链接的文件没有任何问题2)比read.sas7bdat快很多(emem).这是一个快速基准测试(在文件中,该文件比其他文件小)作为证据:

1) It didn't have any trouble reading the above-linked file2) It is much (I'm talking much) faster than read.sas7bdat. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:

microbenchmark(times=10L,
               read.sas7bdat("psu97ai.sas7bdat"),
               read_sas("psu97ai.sas7bdat"))

Unit: milliseconds
                              expr        min         lq       mean     median         uq        max neval cld
 read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152    10   b
      read_sas("psu97ai.sas7bdat")   397.9955   402.2627   410.4015   408.5038   418.1059   425.2762    10  a

是的-haven::read_sas平均花费的时间 sas7bdat::read.sas7bdat节省99.5% .

That's right--haven::read_sas takes (on average) 99.5% less time than sas7bdat::read.sas7bdat.

我以前无法弄清楚这两种方法是否产生相同的数据(即,在读取数据时,它们具有相同的保真度),但最终做到了:

I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:

# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))

# read.sas7bdat prefers strings as factors,
#   and as of now has no stringsAsFactors argument
#   with which to prevent this
idj_factor <- sapply(haven, is.factor)

# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]

# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE

但是,请注意read.sas7bdat保留了文件的大量属性列表,大概是SAS的保留项:

However, note that read.sas7bdat has kept a massive list of attributes for the file, presumably a holdover from SAS:

str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
#   ..$ :List of 12
#   .. ..$ name  : chr "NCESSCH"
#   .. ..$ offset: int 200
#   .. ..$ length: int 12
#   .. ..$ type  : chr "character"
#   .. ..$ format: chr "$"
#   .. ..$ fhdr  : int 0
#   .. ..$ foff  : int 76
#   .. ..$ flen  : int 1
#   .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
#   .. ..$ lhdr  : int 0
#   .. ..$ loff  : int 44
#   .. ..$ llen  : int 32
# ...

因此,如果您有机会需要这些属性(例如,我知道有些人特别热衷于label),那么也许read.sas7bdat毕竟是您的选择

So, if by any chance you need these attributes (I know some people are particularly keen on the labels, for instance), perhaps read.sas7bdat is the option for you after all.

这篇关于将SAS sas7bdat数据读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 16:32