本文介绍了仅具有正数和未标记数据集的二进制半监督分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据包含注释(保存在文件中),其中很少有被标记为肯定的。我想使用半监督和分类来将这些评论分为正面和负面两类。我想知道是否在python(scikit-learn)中有任何半监督和PU实现的公共实现?

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?

推荐答案

您可以尝试训练一类SVM,看看能给您带来什么样的结果。我还没有听说过PU纸。我认为,出于所有实际目的,标记一些点然后使用半监督方法会更好。
如果很难找到负点,我会尝试使用启发式方法来找到假定的负点(我认为这与PU论文中的技术类似)。您可以将未标记的与阳性的分类,然后仅查看那些未标记的得分高的得分,或者学习一类SVM或类似的SVM,然后在异常值中寻找负值。

You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods.If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.

如果您对实际完成任务感兴趣,我宁愿将时间花在手动标记上,也不愿实施奇特的方法。

If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

这篇关于仅具有正数和未标记数据集的二进制半监督分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-26 20:10