问题描述
是否可以分解Spark数据框列?使用分解是指创建列中每个唯一值到相同ID的映射.
Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
例如原始数据框:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
到分解版本:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
在scala本身中,这相当简单,但是由于Spark在节点上分布了它的数据帧,所以我不确定如何保持与A->0, B->1, C->2
的映射.
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2
.
此外,假设数据帧非常大(千兆字节),这意味着可能无法将整列加载到单台计算机的内存中.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
能做到吗?
推荐答案
您可以使用StringIndexer
将字母编码为索引:
You can use StringIndexer
to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
数据:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
这篇关于分解Spark列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!