我有一个数据帧df,其结构如下:
输入
amount id
13000 1
30000 2
10000 3
5000 4
我想基于列“金额”的分位数创建一个新列
预期输出:
amount id amount_bin
13000 1 10000
30000 2 15000
10000 3 10000
5000 4 5000
假设质量0.25、0.5和0.75分别为5000、10000和15000
我知道如何在R中执行此操作:
quantile <- quantile(df$amount, probs = c(0, 0.25, 0.50, 0.75, 1.0), na.rm = TRUE,
names = FALSE)
df$amount_bin <- cut(df$amount, breaks = quantile, include.lowest = TRUE,
labels = c(quantile[2], quantile[3], quantile[4], quantile[5]))
最佳答案
您可以从ML库使用QuantileDiscretizer。
根据拟合的分位数创建存储桶:
import org.apache.spark.ml.feature.QuantileDiscretizer
val data = Array((13000, 1), (30000, 2), (10000, 3), (5000, 4))
val df = spark.createDataFrame(data).toDF("amount", "id")
val discretizer = new QuantileDiscretizer()
.setInputCol("amount")
.setOutputCol("result")
.setNumBuckets(4)
val result = discretizer.fit(df).transform(df)
result.show()