本文介绍了如何将数据框的某些列转换为因子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复:
使用 R 识别或编码独特因素

我在使用 R 时遇到了一些麻烦.

I'm having some trouble with R.

我有一个类似于以下的数据集,但要长得多.

I have a data set similar to the following, but much longer.

A B Pulse
1 2 23
2 2 24
2 2 12
2 3 25
1 1 65
1 3 45

基本上,前 2 列已编码.A 有 1, 2 代表 2 个不同的权重.B 有 1, 2, 3 代表 3 个不同的时间.

Basically, the first 2 columns are coded. A has 1, 2 which represent 2 different weights.B has 1, 2, 3 which represent 3 different times.

由于它们是编码数值,R 会将它们视为数值变量.我需要使用因子函数将这些变量转换为因子.

As they are coded numerical values, R will treat them as numerical variables.I need to use the factor function to convert these variables into factors.

帮助?

推荐答案

举个例子:

#Create a data frame
> d<- data.frame(a=1:3, b=2:4)
> d
  a b
1 1 2
2 2 3
3 3 4

#currently, there are no levels in the `a` column, since it's numeric as you point out.
> levels(d$a)
NULL

#Convert that column to a factor
> d$a <- factor(d$a)
> d
  a b
1 1 2
2 2 3
3 3 4

#Now it has levels.
> levels(d$a)
[1] "1" "2" "3"

您也可以在读入数据时处理此问题.参见例如 colClassesstringsAsFactors 参数readCSV().

You can also handle this when reading in your data. See the colClasses and stringsAsFactors parameters in e.g. readCSV().

请注意,在计算上,分解这些列对您没有多大帮助,实际上可能会减慢您的程序速度(尽管可以忽略不计).使用因子将需要将所有值映射到幕后的 ID,因此 data.frame 的任何打印都需要在这些级别上进行查找 - 一个额外的步骤需要时间.

Note that, computationally, factoring such columns won't help you much, and may actually slow down your program (albeit negligibly). Using a factor will require that all values are mapped to IDs behind the scenes, so any print of your data.frame requires a lookup on those levels -- an extra step which takes time.

在存储您不想重复存储但更愿意通过其 ID 引用的字符串时,Factors 非常有用.考虑在此类列中存储更友好的名称以充分受益于因素.

Factors are great when storing strings which you don't want to store repeatedly, but would rather reference by their ID. Consider storing a more friendly name in such columns to fully benefit from factors.

这篇关于如何将数据框的某些列转换为因子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 22:23