分解Spark列

本文介绍了分解Spark列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以分解Spark数据框列?使用分解是指创建列中每个唯一值到相同ID的映射.

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.

例如原始数据框:

+----------+----------------+--------------------+
|      col1|            col2|                col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370|                   A|
|1473492972|4060600988513370|                   A|
|1473509764|4060600988513370|                   B|
|1473513432|4060600988513370|                   C|
|1473513432|4060600988513370|                   A|
+----------+----------------+--------------------+

到分解版本:

+----------+----------------+--------------------+
|      col1|            col2|                col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370|                   0|
|1473492972|4060600988513370|                   0|
|1473509764|4060600988513370|                   1|
|1473513432|4060600988513370|                   2|
|1473513432|4060600988513370|                   0|
+----------+----------------+--------------------+

在scala本身中，这相当简单，但是由于Spark在节点上分布了它的数据帧，所以我不确定如何保持与A->0, B->1, C->2的映射.

In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.

此外，假设数据帧非常大(千兆字节)，这意味着可能无法将整列加载到单台计算机的内存中.

Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.

能做到吗?

推荐答案

您可以使用StringIndexer将字母编码为索引:

You can use StringIndexer to encode letters into indices:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("col3")
  .setOutputCol("col3Index")

val indexed = indexer.fit(df).transform(df)
indexed.show()

+----------+----------------+----+---------+
|      col1|            col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370|   A|      0.0|
|1473492972|4060600988513370|   A|      0.0|
|1473509764|4060600988513370|   B|      1.0|
|1473513432|4060600988513370|   C|      2.0|
|1473513432|4060600988513370|   A|      0.0|
+----------+----------------+----+---------+

数据:

val df = spark.createDataFrame(Seq(
              (1473490929, "4060600988513370", "A"),
              (1473492972, "4060600988513370", "A"),
              (1473509764, "4060600988513370", "B"),
              (1473513432, "4060600988513370", "C"),
              (1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")

这篇关于分解Spark列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！