问题描述
我有一些数据:
transaction <- c(1,2,3);
date <- c("2010-01-31","2010-02-28","2010-03-31");
type <- c("debit", "debit", "credit");
amount <- c(-500, -1000.97, 12500.81);
oldbalance <- c(5000, 4500, 17000.81)
evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction, stringsAsFactors=FALSE);
evolution$date <- as.Date(evolution$date, "%Y-%m-%d");
evolution <- transform(evolution, newbalance = oldbalance + amount);
evolution
如果我输入命令:
type <- factor(type)
其中 type
是名义(分类)变量,那么它对我的数据有什么影响?
where type
is nominal (categorical) variable,then what difference does it make to my data?
谢谢
推荐答案
进行统计时的因素与字符向量:在做统计方面,R 处理因子和字符向量的方式没有区别.事实上,将因子变量作为字符向量通常更容易.
Factors vs character vectors when doing stats:In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.
如果您使用 lm() 进行回归或方差分析,并将字符向量作为分类变量,您将获得正常的模型输出,但带有以下消息:
If you do a regression or ANOVA with lm() with a character vector as a categoricalvariable you'll get normal model output but with the message:
Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'character_x' converted to a factor
操作数据帧时的因素与字符向量:然而,在操作数据帧时,字符向量和因子的处理方式非常不同.关于R&的烦恼的一些信息可以在 Quantum Forest 博客上找到因子,R 陷阱 #3:该死的因素.
Factors vs character vectors when manipulating dataframes:When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.
使用 read.table
或 read.csv
从 .csv 或 .txt 读取数据时使用 stringsAsFactors = FALSE
很有用.正如另一个回复中所述,您必须确保字符向量中的所有内容都是一致的,否则每个错字都会被指定为不同的因素.您可以使用函数 gsub() 来修正拼写错误.
Its useful to use stringsAsFactors = FALSE
when reading data in from a .csv or .txt using read.table
or read.csv
. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.
这是一个工作示例,展示了 lm() 如何为您提供相同的结果一个字符向量和一个因子.
Here is a worked example showing how lm() gives you the same results witha character vector and a factor.
一个随机自变量:
continuous_x <- rnorm(10,10,3)
作为字符向量的随机分类变量:
A random categorical variable as a character vector:
character_x <- (rep(c("dog","cat"),5))
将字符向量转换为因子变量.factor_x
Convert the character vector to a factor variable.factor_x <- as.factor(character_x)
给两个类别随机值:
character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))
在自变量和因变量之间创建随机关系
Create a random relationship between the indepdent variables and a dependent variable
continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value
比较线性模型的输出与因子变量和字符向量.请注意字符向量给出的警告.
Compare the output of a linear model with the factor variable and the charactervector. Note the warning that is given with the character vector.
summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))
这篇关于R:使用因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!