df_wrong = df.copy()df_wrong['Hours_of_dedication'].cat.set_categories( ['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20', '20-25','30-35'], inplace=True)df_wrong['Assignments_avg_grade'].cat.set_categories( ['A', 'C', 'F', 'D', 'B'], inplace=True)rcParams['figure.figsize'] = 14,18X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)y = df_wrong.Resultdt_wrong = tree.DecisionTreeClassifier()dt_wrong.fit(X_wrong, y)t = tree.plot_tree(dt_wrong, feature_names = X_wrong.columns, class_names=["Fail", "Pass"], filled = True, label='all', rounded=True)按预期,树结构比我们尝试建模的简单问题要复杂得多.为了使树正确地预测所有训练样本,它已扩展到深度4,此时单个节点就足够了.这将暗示分类器可能过拟合,因为我们正在极大地增加其复杂性.通过修剪树并调整必要的参数以防止过度拟合,我们也无法解决问题,因为我们通过错误地编码特征而添加了太多噪声.因此,总而言之,一旦对特征进行编码,保留特征的普遍性至关重要,否则如本例所示,我们将失去其所有可预测的功能,只需在模型中添加 noise . /p>This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:Inputimport pandas as pdimport numpy as npfrom sklearn.preprocessing import LabelEncodera = pd.DataFrame(['High','Low','Low','Medium'])le = LabelEncoder()le.fit_transform(a)Outputarray([0, 1, 1, 2], dtype=int64)As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly? 解决方案 Short answer: Using a LabelEncoder to encode any kind of features is a bad idea!This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label: This transformer should be used to encode target values, i.e. y, and not the input X.As you rightly point out in the question, mapping the inherent ordinality of a feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature).An intuitive way to think about it, is by the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 which may translate to boiling. In such case, instead of having a higher degree of certainty in which path a sample should take due to a higher value, the optimal threshold just ends up being meaningless, and we lose its predictable power.Instead, for a simple way of encoding these features, you could go with pd.Categorical, and obtain the categorical's codes if you want a numerical representation respecting the specified order. Or if you need it to have the fit/transform methods to replicate the transformations on unseen data (more common scenario) then you should be using an OrdinalEncoderThough actually seeing why this is a bad idea will be more intuitive than just words.Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:df = pd.DataFrame( {'Hours of dedication': pd.Categorical( values = ['25-30', '20-25', '5-10', '5-10', '40-45', '0-5', '15-20', '20-25', '30-35', '5-10', '10-15', '45-50', '20-25'], categories=['0-5', '5-10', '10-15', '15-20', '20-25', '25-30','30-35','40-45', '45-50']), 'Assignments avg grade': pd.Categorical( values = ['B', 'C', 'F', 'C', 'B', 'D', 'C', 'A', 'B', 'B', 'B', 'A', 'D'], categories=['F', 'D', 'C', 'B','A']), 'Result': pd.Categorical( values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass', 'Fail', 'Fail','Pass','Pass', 'Fail', 'Fail', 'Pass', 'Pass'], categories=['Fail', 'Pass']) } )The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.So the dataframe we'll be using looks as follows:print(df.head()) Hours_of_dedication Assignments_avg_grade Result0 20-25 B Pass1 20-25 C Pass2 5-10 F Fail3 5-10 C Fail4 40-45 B Pass5 0-5 D Fail6 15-20 C Fail7 20-25 A Pass8 30-35 B Pass9 5-10 B FailThe corresponding category codes can be obtained with:X = df.apply(lambda x: x.cat.codes)X.head() Hours_of_dedication Assignments_avg_grade Result0 4 3 11 4 2 12 1 0 03 1 2 04 7 3 15 0 1 06 3 2 07 4 4 18 6 3 19 1 3 0Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:from sklearn import treedt = tree.DecisionTreeClassifier()y = X.pop('Result')dt.fit(X, y)We can visualise the tree structure using plot_tree:t = tree.plot_tree(dt, feature_names = X.columns, class_names=["Fail", "Pass"], filled = True, label='all', rounded=True)Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:df_wrong = df.copy()df_wrong['Hours_of_dedication'].cat.set_categories( ['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20', '20-25','30-35'], inplace=True)df_wrong['Assignments_avg_grade'].cat.set_categories( ['A', 'C', 'F', 'D', 'B'], inplace=True)rcParams['figure.figsize'] = 14,18X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)y = df_wrong.Resultdt_wrong = tree.DecisionTreeClassifier()dt_wrong.fit(X_wrong, y)t = tree.plot_tree(dt_wrong, feature_names = X_wrong.columns, class_names=["Fail", "Pass"], filled = True, label='all', rounded=True)As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model. 这篇关于用于分类功能的LabelEncoder?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 17:55
查看更多