scala - 如何在协作过滤中为ALS隐式反馈设置首选项？

我正在尝试将Spark MLib ALS与隐式反馈一起用于协作过滤。输入数据只有两个字段userId和productId。我没有产品评分，仅提供有关用户购买了哪些产品的信息。因此，为了训练ALS，我使用：

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel

（http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS $）

该API需要Rating对象：

Rating(user: Int, product: Int, rating: Double)

另一方面，关于trainImplicit的文档告诉我们：训练一个矩阵分解模型，给定用户对某些产品给出的“隐式偏好”等级的RDD，形式为（用户ID，产品ID，偏好）。

当我将评级/偏好设置为1时，如下所示：

val ratings = sc.textFile(new File(dir, file).toString).map { line =>
  val fields = line.split(",")
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x => x._1 < 60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()

然后训练ALSL：

 val model = ALS.trainImplicit(ratings, rank, numIter)

我得到RMSE 0.9，如果首选项采用0或1值，这是一个很大的错误：

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
  .join(data.map(x => ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}

所以我的问题是：我应该在rating中设置什么值：

Rating(user: Int, product: Int, rating: Double)

用于隐式训练（在ALS.trainImplicit方法中）？

更新资料

带有：

  val alpha = 40
  val lambda = 0.01

我得到：

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

我猜这仍然是一个大错误。我也得到了奇怪的基线改进，其中基线模型只是平均值（1）。

最佳答案

您可以指定Alpha置信度。默认值为1.0：但请尝试降低。

val alpha = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)

让我们知道如何进行。