如何在Python中进行热编码

如何在Python中进行热编码

本文介绍了如何在Python中进行热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有80%分类变量的机器学习分类问题.如果要使用一些分类器进行分类,是否必须使用一种热编码?我可以在没有编码的情况下将数据传递给分类器吗?

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

我正在尝试进行以下功能选择:

I am trying to do the following for feature selection:

  1. 我阅读了火车文件:

  1. I read the train file:

num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)

  • 我将类别特征的类型更改为类别":

  • I change the type of the categorical features to 'category':

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    

  • 我使用一种热编码:

  • I use one hot encoding:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    

  • 问题是,尽管我使用的是坚固的机器,但第3部分经常卡住.

    The problem is that the 3'rd part often get stuck, although I am using a strong machine.

    因此,没有一种热编码,我就无法进行任何特征选择来确定特征的重要性.

    Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

    您推荐什么?

    推荐答案

    方法1:您可以在熊猫数据框上使用get_dummies.

    示例1:

    import pandas as pd
    s = pd.Series(list('abca'))
    pd.get_dummies(s)
    Out[]:
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0
    3  1.0  0.0  0.0
    

    示例2:

    以下内容将给定的列转换为一个热门列.使用前缀可以有多个假人.

    The following will transform a given column into one hot. Use prefix to have multiple dummies.

    import pandas as pd
    
    df = pd.DataFrame({
              'A':['a','b','a'],
              'B':['b','a','c']
            })
    df
    Out[]:
       A  B
    0  a  b
    1  b  a
    2  a  c
    
    # Get one hot encoding of columns B
    one_hot = pd.get_dummies(df['B'])
    # Drop column B as it is now encoded
    df = df.drop('B',axis = 1)
    # Join the encoded df
    df = df.join(one_hot)
    df
    Out[]:
           A  a  b  c
        0  a  0  1  0
        1  b  1  0  0
        2  a  0  0  1
    

    方法2:使用Scikit学习

    给定具有三个特征和四个样本的数据集,我们让编码器找到每个特征的最大值,然后将数据转换为二进制的一键编码.

    Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

    >>> from sklearn.preprocessing import OneHotEncoder
    >>> enc = OneHotEncoder()
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
    OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
    >>> enc.n_values_
    array([2, 3, 4])
    >>> enc.feature_indices_
    array([0, 2, 5, 9], dtype=int32)
    >>> enc.transform([[0, 1, 1]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])
    

    这里是此示例的链接: http://scikit-learn.org/stable/modules/generation/sklearn.preprocessing.OneHotEncoder.html

    Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

    这篇关于如何在Python中进行热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

    07-23 07:08