问题描述
所有这四个功能似乎与我相似。在某些情况下,其中一些可能会产生相同的结果,有些不是。任何帮助将非常感谢!现在我知道,我认为在内部,因子分解和 LabelEncoder 的工作方式相同,在结果方面没有很大差异。我不知道他们是否会采用大量数据的类似时间。
get_dummies 和 OneHotEncoder 将产生相同的结果,但 OneHotEncoder 只能处理数字,但 get_dummies 将采取各种输入。 get_dummies 将为每个列输入自动生成新的列名,但 OneHotEncoder 不会(它会分配新的列名称1,2,3 ......)。所以 get_dummies 在所有方面都比较好。
如果我错了,请更正我!谢谢!
这四个编码器可分为两类:
将标签编码为分类变量:熊猫 / code>。结果将有一个维度。
大熊猫和scikit学习编码器之间的主要区别是使用scikit学习编码器在 scikit学习管道中使用 fit 和 transform 方法
将标签编入分类变量
熊猫 code>和scikit-learn LabelEncoder 属于第一类。它们可用于创建分类变量,例如将字符转换为数字。
从sklearn导入预处理
#测试data
df = DataFrame(['A','B','B','C'],columns = ['Col'])
df ['Fact'] = pd.factorize df ['Col'])[0]
le = preprocessing.LabelEncoder()
df ['Lab'] = le.fit_transform(df ['Col'])
print(df)
#Col Fact Lab
#0 A 0 0
#1 B 1 1
#2 B 1 1
#3 C 2 2
将分类变量编码为虚拟/指标(二进制)变量
Pandas get_dummies 和scikit-learn OneHotEncoder 属于第二类。它们可用于创建二进制变量。 OneHotEncoder 只能与分类整数一起使用,而 get_dummies 可以与其他类型的变量一起使用。
df = DataFrame(['A','B','B','C'],columns = ['Col'])
df = pd.get_dummies(df)
print(df)
#Col_A Col_B Col_C
#0 1.0 0.0 0.0
#1 0.0 1.0 0.0
#2 0.0 1.0 0.0
#3 0.0 0.0 1.0
从sklearn.preprocessing import OneHotEncoder,LabelEncoder
df = DataFrame(['A','B' 'B','C'],columns = ['Col'])
#为了使用OneHotEncoder
le = preprocessing.LabelEncoder()$ b $,我们需要将第一个字符转换为整数b df ['Col'] = le.fit_transform(df ['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
#0 1 2
#0 1.0 0.0 0.0
#1 0.0 1.0 0.0
#2 0.0 1.0 0.0
#3 0.0 0.0 1.0
All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.
get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.
Please correct me if I am wrong! Thank you!
These four encoders can be split in two categories:
- Encode labels into categorical variables: Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
- Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.
The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.
Encode labels into categorical variables
Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.
from sklearn import preprocessing # Test data df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) df['Fact'] = pd.factorize(df['Col'])[0] le = preprocessing.LabelEncoder() df['Lab'] = le.fit_transform(df['Col']) print(df) # Col Fact Lab # 0 A 0 0 # 1 B 1 1 # 2 B 1 1 # 3 C 2 2
Encode categorical variable into dummy/indicator (binary) variables
Pandas get_dummies and scikit-learn OneHotEncoder belong to the second category. They can be used to create binary variables. OneHotEncoder can only be used with categorical integers while get_dummies can be used with other type of variables.
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) df = pd.get_dummies(df) print(df) # Col_A Col_B Col_C # 0 1.0 0.0 0.0 # 1 0.0 1.0 0.0 # 2 0.0 1.0 0.0 # 3 0.0 0.0 1.0 from sklearn.preprocessing import OneHotEncoder, LabelEncoder df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) # We need to transform first character into integer in order to use the OneHotEncoder le = preprocessing.LabelEncoder() df['Col'] = le.fit_transform(df['Col']) enc = OneHotEncoder() df = DataFrame(enc.fit_transform(df).toarray()) print(df) # 0 1 2 # 0 1.0 0.0 0.0 # 1 0.0 1.0 0.0 # 2 0.0 1.0 0.0 # 3 0.0 0.0 1.0
这篇关于想知道pd.factorize,pd.get_dummies,sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!