问题描述
我想根据用户观看的节目的类别或标签对用户进行分类.最简单/最佳的算法是什么?
I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
假设我可以使用大约20,000个标签和数百万个观看事件作为信号,是否可以使用例如pig/hadoop/mortar或在neo4j上实现的算法?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
在数据方面,我有用户,他们观看过的程序以及该程序具有的标签(通常每个程序约有10个标签).
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
我希望在最后得到k个群集(可能是十几个?)或广泛的存储桶,我可以使用它们将用户分类和分组为存储桶,并获得一些关于如何将其划分的见解-代表每个群集的一组标签.
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
我已经看到一些建议使用分级算法的帖子,但是不确定在这种情况下如何计算距离".那是两个用户之间,还是一个用户与一组标签之间的距离,等等.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..
推荐答案
您基本上想根据用户的标签聚集用户.
You basically want to cluster the users according to their tags.
为简单起见,假设您只有10个标签(而不是20,000个).假设一个用户,例如user_34,具有第二和第七标签.对于此聚类任务,可以将user_34表示为 10维空间中的一个点,其对应的坐标为:[0,1,0,0,0, 0,1,0,0,0].
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
在您自己的情况下,可以将每个用户类似地表示为20,000维空间中的一个点.您可以使用 Apache Mahout ,其中包含许多有效的聚类算法,例如K-means.
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
由于所有内容都在数学坐标系中得到了很好的定义,因此计算两个用户之间的距离非常容易!可以使用任何距离函数来计算,但是欧几里德距离是事实上的标准.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
注意: Mahout和许多其他数据挖掘程序支持许多适用于SPARSE功能的格式,即,您不需要插入 ...,0,0,0,0, ... ,但只需要指定选择了哪些标签即可. (请参阅Mahout中的 RandomAccessSparseVector.)
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
注意:我假设您只想集群用户.从集群中提取代表性信息有些棘手.例如,对于每个群集,您可以选择在群集用户之间更常见的标签.另外,您可以使用信息理论中的概念,例如信息增益,找出哪些标签包含有关群集的更多信息.
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.
这篇关于如何基于标签对用户进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!