问题描述
因子R是R中的一种基本数据类型.根据我的经验,因素基本上是痛苦的,我从不使用它们.我总是转换为字符.我感到奇怪的是,我想念一些东西.
One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.
在需要使用因子数据类型的情况下,是否存在一些使用因子作为分组变量的函数的重要示例?在应该使用因素时,是否有特定情况?
Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?
推荐答案
您应该使用因素.是的,它们可能会很痛苦,但我的理论是,为什么会造成疼痛,其中90%是因为在read.table
和read.csv
中,默认情况下参数stringsAsFactors = TRUE
(并且大多数用户都忽略了这种微妙之处).我说它们很有用,因为模型拟合程序包(例如lme4)使用因素和有序因素来差异拟合模型并确定要使用的对比类型.图形包也使用它们进行分组. ggplot
和大多数模型拟合函数将字符向量强制转换为因子,因此结果是相同的.但是,最终在代码中出现警告:
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table
and read.csv
, the argument stringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:
lm(Petal.Length ~ -1 + Species, data=iris)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
变量Species
转换为factor
一个棘手的事情是整个drop=TRUE
位.在向量中,这很好地删除了数据中没有的因子水平.例如:
One tricky thing is the whole drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:
s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
但是,对于data.frame
,[.data.frame()
的行为是不同的:请参见此电子邮件或?"[.data.frame"
.像您想象的那样,在data.frame
上使用drop=TRUE
无效:
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
幸运的是,您可以使用droplevels()
轻松删除因子,以删除单个因子或data.frame
中的每个因子的未使用因子水平(自R 2.12起):
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"
这是防止您选择的级别进入ggplot
图例的方法.
This is how to keep levels you've selected out from getting in ggplot
legends.
在内部,factor
是带有属性级别字符向量的整数(请参见attributes(iris$Species)
和class(attributes(iris$Species)$levels)
),这是干净的.如果您必须更改级别名称(并且您正在使用字符串),那么这将是很多效率较低的操作.而且我更改了很多级别名称,尤其是对于ggplot
图例.如果您使用字符向量伪造因子,则可能会只更改一个元素,并意外地创建了一个单独的新级别.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
这篇关于R中的因素:是否比烦恼还重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!