R中的因素:是否比烦恼还重要?

本文介绍了R中的因素:是否比烦恼还重要?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因子R是R中的一种基本数据类型.根据我的经验，因素基本上是痛苦的，我从不使用它们.我总是转换为字符.我感到奇怪的是，我想念一些东西.

One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.

在需要使用因子数据类型的情况下，是否存在一些使用因子作为分组变量的函数的重要示例?在应该使用因素时，是否有特定情况?

Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?

推荐答案

您应该使用因素.是的，它们可能会很痛苦，但我的理论是，为什么会造成疼痛，其中90％是因为在read.table和read.csv中，默认情况下参数stringsAsFactors = TRUE(并且大多数用户都忽略了这种微妙之处).我说它们很有用，因为模型拟合程序包(例如lme4)使用因素和有序因素来差异拟合模型并确定要使用的对比类型.图形包也使用它们进行分组. ggplot和大多数模型拟合函数将字符向量强制转换为因子，因此结果是相同的.但是，最终在代码中出现警告:

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552

变量Species转换为factor

一个棘手的事情是整个drop=TRUE位.在向量中，这很好地删除了数据中没有的因子水平.例如:

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

但是，对于data.frame，[.data.frame()的行为是不同的:请参见此电子邮件或?"[.data.frame".像您想象的那样，在data.frame上使用drop=TRUE无效:

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

幸运的是，您可以使用droplevels()轻松删除因子，以删除单个因子或data.frame中的每个因子的未使用因子水平(自R 2.12起):

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa"     "versicolor" "virginica" 
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

这是防止您选择的级别进入ggplot图例的方法.

This is how to keep levels you've selected out from getting in ggplot legends.

在内部，factor是带有属性级别字符向量的整数(请参见attributes(iris$Species)和class(attributes(iris$Species)$levels))，这是干净的.如果您必须更改级别名称(并且您正在使用字符串)，那么这将是很多效率较低的操作.而且我更改了很多级别名称，尤其是对于ggplot图例.如果您使用字符向量伪造因子，则可能会只更改一个元素，并意外地创建了一个单独的新级别.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

这篇关于R中的因素:是否比烦恼还重要?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！