问题描述
我正在使用标签编码器
data = [[1,'A'],
[1,'A'],
[1,'B'],
[2,'C']]
le = LabelEncoder()
df = pd.DataFrame(data = data,columns = ['id','element'])
df['element'] = le.fit_transform(df['element'])
输出
id element
0 1 0
1 2 0
2 3 1
3 4 2
这很好,但如果我有很多数据,那么序列就会像这样混合起来
Which is fine but if I have lot of data than the sequence gets mashed up something like this
id element
0 1 1
1 2 1
2 3 2
3 4 0
任何没有标签编码器的解决方案,确保保持序列
Any solution without label encoder which makes sure the sequence is maintained
推荐答案
TL;DR:对于一个简单的方法,有 pd.factorize
.虽然对于通常的 scikit-learn fit
/transform
方法的方法 OrderedLabelEncoder
被定义,它简单地覆盖基类的两个方法以获得一个编码,其中代码按类的出现顺序排序.
TL;DR: For a simple approach there's pd.factorize
. Though for an approach with the usual scikit-learn fit
/transform
methods OrderedLabelEncoder
is defined, which simply overrides two of the base class' methods to obtain an encoding where codes are ordered by order of appearance of the classes.
object
dtype 列中的类在 LabelEncoder
,这会导致结果代码显示为无序.这可以在 中看到_encode_python
,在fit
方法.其中,当列 dtype
是 object
时,classes
变量(然后用于映射值)通过采用 set 来定义代码>.一个明显的例子,可以是(复制在
_encode_python
中所做的):
The classes in object
dtype columns get sorted lexicographically in LabelEncoder
, which causes the resulting codes to appear unordered. This can be seen in _encode_python
, which is called in it's fit
method. In it, when the column dtype
is object
the classes
variable (then used to map the values) are defined by taking a set
. A clear example, could be (replicates what is done in _encode_python
):
df = pd.DataFrame([[1,'C'],[1,'C'],[1,'B'],[2,'A']], columns=['id','element'])
values = df.element.to_numpy()
# array(['C', 'C', 'B', 'A'], dtype=object)
uniques = sorted(set(values))
uniques = np.array(uniques, dtype=values.dtype)
table = {val: i for i, val in enumerate(uniques)}
print(table)
{'A': 0, 'B': 1, 'C': 2}
生成的 set
用于定义一个查找表,该表将确定特征的顺序.
The resulting set
is used to define a lookup table which will determine the order of the features.
因此,在这种情况下,我们会得到:
Hence, in this case we'd get:
ole = LabelEncoder()
ole.fit_transform(df.element)
# array([2, 2, 1, 0])
对于一个简单的替代方案,您有 pd.factorize
,这将保持顺序:
For a simple alternative, you have pd.factorize
, which will mantain sequencial order:
df['element'] = pd.factorize(df.element)[0]
虽然如果您需要一个具有通常 scikit-learn fit
/transform
方法的类,我们可以重新定义定义类的特定函数,并提出一个相当于保持出现的顺序.一个简单的方法,可以是使用 uniques = list(dict.fromkeys(values))
将列值设置为字典键(保持 Python 的插入顺序 >3.7):
Though if you need a class with the usual scikit-learn fit
/transform
methods, we could redefine the specific function that defines the classes, and come up with an equivalent that maintains the order of appearance. A simple approach, could be to set the column values as dictionary keys (which maintain insertion order for Pythons >3.7) with uniques = list(dict.fromkeys(values))
:
def ordered_encode_python(values, uniques=None, encode=False):
# only used in _encode below, see docstring there for details
if uniques is None:
uniques = list(dict.fromkeys(values))
uniques = np.array(uniques, dtype=values.dtype)
if encode:
table = {val: i for i, val in enumerate(uniques)}
try:
encoded = np.array([table[v] for v in values])
except KeyError as e:
raise ValueError("y contains previously unseen labels: %s"
% str(e))
return uniques, encoded
else:
return uniques
然后我们可以继承 LabelEncoder
并将 OrderedLabelEncoder
定义为:
Then we could inherit from LabelEncoder
and define OrderedLabelEncoder
as:
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import column_or_1d
class OrderedLabelEncoder(LabelEncoder):
def fit(self, y):
y = column_or_1d(y, warn=True)
self.classes_ = ordered_encode_python(y)
def fit_transform(self, y):
y = column_or_1d(y, warn=True)
self.classes_, y = ordered_encode_python(y, encode=True)
return y
然后可以像使用 LabelEncoder
一样继续,例如:
One could then proceed just as with the LabelEncoder
, for instance:
ole = OrderedLabelEncoder()
ole.fit(df.element)
ole.classes_
# array(['C', 'B', 'A'], dtype=object)
ole.transform(df.element)
# array([0, 0, 1, 2])
ole.inverse_transform(np.array([0, 0, 1, 2]))
# array(['C', 'C', 'B', 'A'], dtype=object)
或者我们也可以调用fit_transform
:
ole.fit_transform(df.element)
# array([0, 0, 1, 2])
这篇关于具有有序编码的 LabelEncoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!