本文介绍了XGBoost/CatBoost 中具有大量类别的分类变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于随机森林的问题.想象一下,我有用户与项目交互的数据.项目的数量很大,大约 10 000.我的随机森林输出应该是用户可能与之交互的项目(如推荐系统).对于任何用户,我想使用一个功能来描述用户过去与之交互的项目.然而,将分类产品特征映射为单热编码似乎非常低效,因为用户与最多不超过几百个项目交互,有时只有 5 个.

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.

当输入特征之一是具有约 10 000 个可能值的分类变量而输出是具有约 10 000 个可能值的分类变量时,您将如何构建随机森林?我应该将 CatBoost 与分类功能一起使用吗?或者我应该使用 one-hot 编码,如果是这样,您认为 XGBoost 还是 CatBoost 更好?

How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?

推荐答案

您还可以尝试实体嵌入,将数百个布尔特征减少为小维向量.

You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.

它类似于分类特征的词嵌入.实际上,您可以定义将离散特征空间嵌入到低维向量空间中.它可以增强您的结果并节省内存.缺点是您需要事先训练一个神经网络模型来定义嵌入.

It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.

查看这篇文章了解更多信息.

这篇关于XGBoost/CatBoost 中具有大量类别的分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:10