本文介绍了PySpark在嵌套数组中反转StringIndexer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark使用ALS进行协作过滤.我的原始用户名和项目ID是字符串,因此我使用StringIndexer将其转换为数字索引(PySpark的ALS模型要求我们这样做).

I'm using PySpark to do collaborative filtering using ALS. My original user and item id's are strings, so I used StringIndexer to convert them to numeric indices (PySpark's ALS model obliges us to do so).

在拟合模型之后,我可以像这样获得每个用户的前三项建议:

After I've fitted the model, I can get the top 3 recommendations for each user like so:

recs = (
    model
    .recommendForAllUsers(3)
)

recs数据帧如下所示:

+-----------+--------------------+
|userIdIndex|     recommendations|
+-----------+--------------------+
|       1580|[[10096,3.6725707...|
|       4900|[[10096,3.0137873...|
|       5300|[[10096,2.7274625...|
|       6620|[[10096,2.4493625...|
|       7240|[[10096,2.4928937...|
+-----------+--------------------+
only showing top 5 rows

root
 |-- userIdIndex: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- productIdIndex: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

我想用这个数据帧创建一个巨大的JSOM转储,我可以这样:

I want to create a huge JSOM dump with this dataframe, and I can like so:

(
    recs
    .toJSON()
    .saveAsTextFile("name_i_must_hide.recs")
)

这些json的示例是:

and a sample of these jsons is:

{
  "userIdIndex": 1580,
  "recommendations": [
    {
      "productIdIndex": 10096,
      "rating": 3.6725707
    },
    {
      "productIdIndex": 10141,
      "rating": 3.61542
    },
    {
      "productIdIndex": 11591,
      "rating": 3.536216
    }
  ]
}

userIdIndexproductIdIndex键归因于StringIndexer转换.

如何获取这些列的原始值?我怀疑必须使用IndexToString转换器,但由于数据嵌套在recs数据框内的数组中,因此我无法完全弄清楚.

How can I get the original value of these columns back? I suspect I must use the IndexToString transformer, but I can't quite figure out how since the data is nested in an array inside the recs Dataframe.

我尝试使用Pipeline评估程序(stages=[StringIndexer, ALS, IndexToString]),但看来该评估程序不支持这些索引器.

I tried to use a Pipeline evaluator (stages=[StringIndexer, ALS, IndexToString]) but it looks like this evaluator doesn't support these indexers.

干杯!

推荐答案

在两种情况下,您都需要访问标签列表.可以使用StringIndexerModel

In both cases you'll need an access to the list of labels. This can be accessed using either a StringIndexerModel

user_indexer_model = ...  # type: StringIndexerModel
user_labels = user_indexer_model.labels

product_indexer_model = ...  # type: StringIndexerModel
product_labels = product_indexer_model.labels

或列元数据.

对于userIdIndex,您只需应用IndexToString:

from pyspark.ml.feature import IndexToString

user_id_to_label = IndexToString(
    inputCol="userIdIndex", outputCol="userId", labels=user_labels)
user_id_to_label.transform(recs)

对于建议,您将需要udf或类似这样的表达式:

For recommendations you'll need either udf or expression like this:

from pyspark.sql.functions import array, col, lit, struct

n = 3  # Same as numItems

product_labels_ = array(*[lit(x) for x in product_labels])
recommendations = array(*[struct(
    product_labels_[col("recommendations")[i]["productIdIndex"]].alias("productId"),
    col("recommendations")[i]["rating"].alias("rating")
) for i in range(n)])

recs.withColumn("recommendations", recommendations)

这篇关于PySpark在嵌套数组中反转StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-21 14:26