我有一个数据帧df,其结构如下:

输入

amount id
13000  1
30000  2
10000  3
5000   4

我想基于列“金额”的分位数创建一个新列

预期输出:
amount id amount_bin
13000  1  10000
30000  2  15000
10000  3  10000
5000   4  5000

假设质量0.25、0.5和0.75分别为5000、10000和15000

我知道如何在R中执行此操作:
quantile <- quantile(df$amount, probs = c(0, 0.25, 0.50, 0.75, 1.0), na.rm = TRUE,
                     names = FALSE)

df$amount_bin <- cut(df$amount, breaks = quantile, include.lowest = TRUE,
                     labels = c(quantile[2], quantile[3], quantile[4], quantile[5]))

最佳答案

您可以从ML库使用QuantileDiscretizer

根据拟合的分位数创建存储桶:

import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((13000, 1), (30000, 2), (10000, 3), (5000, 4))
val df = spark.createDataFrame(data).toDF("amount", "id")

val discretizer = new QuantileDiscretizer()
  .setInputCol("amount")
  .setOutputCol("result")
  .setNumBuckets(4)

val result = discretizer.fit(df).transform(df)
result.show()

09-10 23:28