问题描述
R对于将本机SAS格式sas7bdat
的文件读取到R中,R有什么选择?
What options does R have for reading files in the native SAS format, sas7bdat
, into R?
例如, NCES通用核心包含一个广泛的数据文件存储库以这种格式.具体来说,让我们集中精力尝试读取LEA Universe的此文件在1997-98年间,其中包含从A到I的所有州的实体在教育机构一级的人口统计信息.
The NCES Common Core, for example, contains an extensive repository of data files saved in this format. For concreteness, let's focus on trying to read in this file from LEA Universe in 1997-98, which contains education-agency-level demographics for entities in all states beginning A through I.
这是SAS数据预览:
将这些数据引入我的R环境的最简单方法是什么?我没有可用的SAS版本,也不愿意付费,因此仅将其转换为.csv会很麻烦.
What's the simplest way to bring this data in to my R environment? I don't have any version of SAS available and am not willing to pay, so simply converting it to .csv would be a hassle.
推荐答案
sas7bdat
除了我正在查看的一个文件(特别是这一个);在向sas7bdat
开发人员Matthew Shotwell报告错误时,他还向我指出了R中Hadley的haven
程序包的方向,该程序包也具有read_sas
方法.
sas7bdat
worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat
developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven
package in R which also has a read_sas
method.
此方法优越的原因有两个:
This method is superior for two reasons:
1)读取上面链接的文件没有任何问题2)比read.sas7bdat
快很多(emem).这是一个快速基准测试(在此文件中,该文件比其他文件小)作为证据:
1) It didn't have any trouble reading the above-linked file2) It is much (I'm talking much) faster than read.sas7bdat
. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:
microbenchmark(times=10L,
read.sas7bdat("psu97ai.sas7bdat"),
read_sas("psu97ai.sas7bdat"))
Unit: milliseconds
expr min lq mean median uq max neval cld
read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152 10 b
read_sas("psu97ai.sas7bdat") 397.9955 402.2627 410.4015 408.5038 418.1059 425.2762 10 a
是的-haven::read_sas
平均花费的时间 比sas7bdat::read.sas7bdat
节省99.5% .
That's right--haven::read_sas
takes (on average) 99.5% less time than sas7bdat::read.sas7bdat
.
我以前无法弄清楚这两种方法是否产生相同的数据(即,在读取数据时,它们具有相同的保真度),但最终做到了:
I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:
# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))
# read.sas7bdat prefers strings as factors,
# and as of now has no stringsAsFactors argument
# with which to prevent this
idj_factor <- sapply(haven, is.factor)
# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]
# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE
但是,请注意read.sas7bdat
保留了文件的大量属性列表,大概是SAS的保留项:
However, note that read.sas7bdat
has kept a massive list of attributes for the file, presumably a holdover from SAS:
str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
# ..$ :List of 12
# .. ..$ name : chr "NCESSCH"
# .. ..$ offset: int 200
# .. ..$ length: int 12
# .. ..$ type : chr "character"
# .. ..$ format: chr "$"
# .. ..$ fhdr : int 0
# .. ..$ foff : int 76
# .. ..$ flen : int 1
# .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
# .. ..$ lhdr : int 0
# .. ..$ loff : int 44
# .. ..$ llen : int 32
# ...
因此,如果您有机会需要这些属性(例如,我知道有些人特别热衷于label
),那么也许read.sas7bdat
毕竟是您的选择
So, if by any chance you need these attributes (I know some people are particularly keen on the label
s, for instance), perhaps read.sas7bdat
is the option for you after all.
这篇关于将SAS sas7bdat数据读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!