本文介绍了如何计算Twitter中两个用户的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个关于数据挖掘的项目。我的公司给了我600万个推特的虚拟客户信息。我被分配找出任何两个用户之间的相似性。任何人都可以给我一些如何处理大型社区数据的想法?提前致谢



问题:我使用推文和主题标签信息(主题标签是用户突出显示的那些词)作为衡量两个不同用户之间相似性的两个标准。由于用户数量众多,特别是每个用户可能有数百万个hastags和tweet。谁能告诉我一个快速计算两个用户之间相似性的好方法?我曾尝试使用FT-IDF来计算两个不同用户之间的相似性,但似乎不可行。任何人都可以有一个非常超级的算法或好的想法,可以让我快速找到用户之间的所有相似之处吗?



例如:

用户一个''hashtag = {cat,bull,cow,chicken,duck}

用户B'的hashtag = {cat,chicken,cloth}

用户C' 's hashtag = {lenovo,Hp,Sony}



显然,C与A没有关系,所以没有必要计算相似的浪费时间,我们可以在计算相似度之前先过滤掉所有那些不相关的用户。实际上,超过90%的总用户与特定用户无关。如何使用hashtag作为标准来快速找到那些潜在的类似用户组A?这是一个好主意吗?或者我们只是直接计算A和所有其他用户之间的相对相似度?什么算法是问题的最快和定制的算法?

I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A''s hashtag = {cat, bull, cow, chicken, duck}
user B''s hashtag ={cat, chicken, cloth}
user C''s hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

推荐答案


SELECT HashtagID,
            COUNT(HashtagID)
from UserHashtag
WHERE HashtagID IN (
                    SELECT HashtagID
                    FROM UserHashtag
                    WHERE UserID = 1
                   ) --Get all the tags that belong to this user.
and UserID != 1             --don't match the current user
HAVING COUNT(HashtagID) > 2 --For 3 or more matches
GROUP BY UserID
order by COUNT(HashtagID) 





祝你好运!



Hogan



Good luck!

Hogan



这篇关于如何计算Twitter中两个用户的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 13:00
查看更多