具有有序编码的 LabelEncoder

本文介绍了具有有序编码的 LabelEncoder的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用标签编码器

data = [[1,'A'],
        [1,'A'],
        [1,'B'],
        [2,'C']]

le = LabelEncoder()
df = pd.DataFrame(data = data,columns = ['id','element'])
df['element'] = le.fit_transform(df['element'])

输出

   id  element
0   1        0
1   2        0
2   3        1
3   4        2

这很好，但如果我有很多数据，那么序列就会像这样混合起来

Which is fine but if I have lot of data than the sequence gets mashed up something like this

   id  element
0   1       1
1   2       1
2   3       2
3   4       0

任何没有标签编码器的解决方案，确保保持序列

Any solution without label encoder which makes sure the sequence is maintained

推荐答案

TL;DR:对于一个简单的方法，有 pd.factorize.虽然对于通常的 scikit-learn fit/transform 方法的方法 OrderedLabelEncoder 被定义，它简单地覆盖基类的两个方法以获得一个编码，其中代码按类的出现顺序排序.

TL;DR: For a simple approach there's pd.factorize. Though for an approach with the usual scikit-learn fit/transform methods OrderedLabelEncoder is defined, which simply overrides two of the base class' methods to obtain an encoding where codes are ordered by order of appearance of the classes.

object dtype 列中的类在 LabelEncoder，这会导致结果代码显示为无序.这可以在 中看到_encode_python，在fit 方法.其中，当列 dtype 是 object 时，classes 变量(然后用于映射值)通过采用 set 来定义.一个明显的例子，可以是(复制在 _encode_python 中所做的):

The classes in object dtype columns get sorted lexicographically in LabelEncoder, which causes the resulting codes to appear unordered. This can be seen in _encode_python, which is called in it's fit method. In it, when the column dtype is object the classes variable (then used to map the values) are defined by taking a set. A clear example, could be (replicates what is done in _encode_python):

df = pd.DataFrame([[1,'C'],[1,'C'],[1,'B'],[2,'A']], columns=['id','element'])
values = df.element.to_numpy()
# array(['C', 'C', 'B', 'A'], dtype=object)
uniques = sorted(set(values))
uniques = np.array(uniques, dtype=values.dtype)
table = {val: i for i, val in enumerate(uniques)}
print(table)
{'A': 0, 'B': 1, 'C': 2}

生成的 set 用于定义一个查找表，该表将确定特征的顺序.

The resulting set is used to define a lookup table which will determine the order of the features.

因此，在这种情况下，我们会得到:

Hence, in this case we'd get:

ole = LabelEncoder()
ole.fit_transform(df.element)
# array([2, 2, 1, 0])

对于一个简单的替代方案，您有 pd.factorize，这将保持顺序:

For a simple alternative, you have pd.factorize, which will mantain sequencial order:

df['element'] = pd.factorize(df.element)[0]

虽然如果您需要一个具有通常 scikit-learn fit/transform 方法的类，我们可以重新定义定义类的特定函数，并提出一个相当于保持出现的顺序.一个简单的方法，可以是使用 uniques = list(dict.fromkeys(values)) 将列值设置为字典键(保持 Python 的插入顺序 >3.7):

Though if you need a class with the usual scikit-learn fit/transform methods, we could redefine the specific function that defines the classes, and come up with an equivalent that maintains the order of appearance. A simple approach, could be to set the column values as dictionary keys (which maintain insertion order for Pythons >3.7) with uniques = list(dict.fromkeys(values)):

def ordered_encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = list(dict.fromkeys(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

然后我们可以继承 LabelEncoder 并将 OrderedLabelEncoder 定义为:

Then we could inherit from LabelEncoder and define OrderedLabelEncoder as:

from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import column_or_1d

class OrderedLabelEncoder(LabelEncoder):
    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = ordered_encode_python(y)
    def fit_transform(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_, y = ordered_encode_python(y, encode=True)
        return y

然后可以像使用 LabelEncoder 一样继续，例如:

One could then proceed just as with the LabelEncoder, for instance:

ole = OrderedLabelEncoder()
ole.fit(df.element)
ole.classes_
# array(['C', 'B', 'A'], dtype=object)
ole.transform(df.element)
# array([0, 0, 1, 2])
ole.inverse_transform(np.array([0, 0, 1, 2]))
# array(['C', 'C', 'B', 'A'], dtype=object)

或者我们也可以调用fit_transform:

ole.fit_transform(df.element)
# array([0, 0, 1, 2])

                        这篇关于具有有序编码的 LabelEncoder的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！