问题描述
到目前为止,我已经成为学校项目的一部分,这个项目非常有趣,而且变得更加有趣.我拥有大约60万条推文(每条推文都包含屏幕名称,地理位置,文本等),我的目标是尝试将每个用户分类为男性还是女性.现在使用Twitter4J,我可以获得用户的全名,朋友数,转发消息等.因此,我想知道是否可以同时查看用户名和进行文本分析.我本来以为我可以将其变成基于规则的分类器,首先可以查看用户的名称,然后分析其文本并尝试得出M或F的结论.我猜想使用诸如天真的贝叶斯,因为我没有真实的真相值?
I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
还要加上名称,我还要检查某种词典来解释该名称是男性还是女性.我知道在某些情况下很难说出来,但这就是为什么我也要看他们的推文.我也忘了提;有了这60万条推文,我每个用户至少有两条推文可供我使用.
Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.
任何对用户性别分类的想法或建议将不胜感激!我在这方面没有很多经验,我想学习任何我可以实践的东西.
Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.
推荐答案
任何监督学习算法,例如朴素贝叶斯(Naive Bayes),都需要准备训练集.没有某些数据的实际性别,您就无法建立这样的模型.另一方面,如果您提出了一些基于规则的系统(例如基于用户名的规则系统),则可以尝试使用半监督方法.使用基于规则的系统,您可以为数据创建一些标签,假设基于规则的分类器为RC
,并且可以回答男",女",不知道",则可以为自己的数据创建标签数据X
以自然方式使用RC
:
Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is RC
and can answer "Male", "Female", "Do not know", you can create a labelling of your data X
using RC
in a natural way:
X_m = { x in X : RC(x)="Male" }
X_f = { x in X : RC(x)="Female" }
完成此操作后,您就可以使用所有数据为受监督的学习模型创建训练集,除了用于创建RC
的数据-在这种情况下-用户名(I假设RC
完全是确定"的,则回答男"或女".结果,您将训练一个分类器,该分类器将尝试从所有其他数据(例如使用的单词,位置等)中归纳性别概念.让我们称之为SC
.之后,您可以简单地创建一个复杂"分类器:
Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating RC
- so in this case - users' names (I assume, that RC
answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call it SC
. After that, you can simply create a "complex" classifier:
C(x) = "Male" iff RC(x)= Male" or
(RC(x)="Do not know" && SC(x)="Male")
"Female" iff RC(x)= Female" or
(RC(x)="Do not know" && SC(x)="Female")
通过这种方式,您可以一方面以基于规则的方式使用最有价值的信息(用户名),而与此同时,利用有困难的案例"的有监督学习的能力,而又没有基本事实"第一名.
This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.
这篇关于使用朴素贝叶斯分类来标识Twitter用户的性别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!