R:使用因子 | categorical

本文介绍了R:使用因子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些数据:

transaction <- c(1,2,3);
date <- c("2010-01-31","2010-02-28","2010-03-31");
type <- c("debit", "debit", "credit");
amount <- c(-500, -1000.97, 12500.81);
oldbalance <- c(5000, 4500, 17000.81)
evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction,  stringsAsFactors=FALSE);
evolution$date <- as.Date(evolution$date, "%Y-%m-%d");
evolution <- transform(evolution, newbalance = oldbalance + amount);
evolution

如果我输入命令:

type <- factor(type)

其中 type 是名义(分类)变量，那么它对我的数据有什么影响?

where type is nominal (categorical) variable,then what difference does it make to my data?

谢谢

推荐答案

进行统计时的因素与字符向量:在做统计方面，R 处理因子和字符向量的方式没有区别.事实上，将因子变量作为字符向量通常更容易.

Factors vs character vectors when doing stats:In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.

如果您使用 lm() 进行回归或方差分析，并将字符向量作为分类变量，您将获得正常的模型输出，但带有以下消息:

If you do a regression or ANOVA with lm() with a character vector as a categoricalvariable you'll get normal model output but with the message:

Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'character_x' converted to a factor

操作数据帧时的因素与字符向量:然而，在操作数据帧时，字符向量和因子的处理方式非常不同.关于R&的烦恼的一些信息可以在 Quantum Forest 博客上找到因子，R 陷阱 #3:该死的因素.

Factors vs character vectors when manipulating dataframes:When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.

使用 read.table 或 read.csv 从 .csv 或 .txt 读取数据时使用 stringsAsFactors = FALSE 很有用.正如另一个回复中所述，您必须确保字符向量中的所有内容都是一致的，否则每个错字都会被指定为不同的因素.您可以使用函数 gsub() 来修正拼写错误.

Its useful to use stringsAsFactors = FALSE when reading data in from a .csv or .txt using read.table or read.csv. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.

这是一个工作示例，展示了 lm() 如何为您提供相同的结果一个字符向量和一个因子.

Here is a worked example showing how lm() gives you the same results witha character vector and a factor.

一个随机自变量:

continuous_x <- rnorm(10,10,3)

作为字符向量的随机分类变量:

A random categorical variable as a character vector:

character_x  <- (rep(c("dog","cat"),5))

将字符向量转换为因子变量.factor_x

Convert the character vector to a factor variable.factor_x <- as.factor(character_x)

给两个类别随机值:

character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))

在自变量和因变量之间创建随机关系

Create a random relationship between the indepdent variables and a dependent variable

continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value

比较线性模型的输出与因子变量和字符向量.请注意字符向量给出的警告.

Compare the output of a linear model with the factor variable and the charactervector. Note the warning that is given with the character vector.

summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))

这篇关于R:使用因子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！