问题描述
我正在研究 spark mllib 算法.我拥有的数据集是这种形式
I'm working on a spark mllib algorithm. The dataset I have is in this form
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(还有更多类似的值)
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these)
我正在尝试将字符串值原始编码为数字值.因此,我尝试使用 zipwithuniqueID 作为每个字符串值的唯一值.出于某种原因,我无法将修改后的数据集保存到磁盘.我可以使用 spark SQL 以任何方式执行此操作吗?或者什么是更好的方法?
Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this?
推荐答案
Scala
val dataFrame1 = dataFrame0.withColumn("index",monotonically_increasing_id())
Java
Import org.apache.spark.sql.functions;
Dataset<Row> dataFrame1 = dataFrame0.withColumn("index",functions.monotonically_increasing_id());
这篇关于如何在 spark SQL 中为表添加增量列 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!