为什么默认情况下，Spark的OneHotEncoder删除最后一个类别?

本文介绍了为什么默认情况下，Spark的OneHotEncoder删除最后一个类别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解Spark的OneHotEncoder默认删除最后一个类别的背后原因.

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

例如:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

默认情况下，OneHotEncoder将删除最后一个类别:

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

当然，可以更改此行为:

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

问题::

在什么情况下需要默认行为?
盲目调用setDropLast(False)可能会忽略哪些问题?
文档中的以下陈述作者是什么意思?

In what case is the default behavior desirable?
What issues might be overlooked by blindly calling setDropLast(False)?
What do the authors mean by the following statment in the documentation?

Spark的OneHotEncoder删除最后一个类别

为什么默认情况下，Spark的OneHotEncoder删除最后一个类别?

问题描述

推荐答案