Spark的OneHotEncoder删除最后一个类别

Spark的OneHotEncoder删除最后一个类别

本文介绍了为什么默认情况下,Spark的OneHotEncoder删除最后一个类别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解Spark的OneHotEncoder默认删除最后一个类别的背后原因.

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

例如:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

默认情况下,OneHotEncoder将删除最后一个类别:

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

当然,可以更改此行为:

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

问题::

  • 在什么情况下需要默认行为?
  • 盲目调用setDropLast(False)可能会忽略哪些问题?
  • 文档中的以下陈述作者是什么意思?
  • In what case is the default behavior desirable?
  • What issues might be overlooked by blindly calling setDropLast(False)?
  • What do the authors mean by the following statment in the documentation?

推荐答案

根据文档,列的保持独立性:

According to the doc it is to keep the column independents :

https: //spark.apache.org/docs/1.5.2/api/java/org/apache/spark/ml/feature/OneHotEncoder.html

这篇关于为什么默认情况下,Spark的OneHotEncoder删除最后一个类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 16:44