问题描述
我正面临文本分类问题,我需要将示例分类为34个组.
I'm facing a text classification problem, and I need to classify examples to 34 groups.
问题是,34组训练数据的大小不平衡.对于某些小组,我有2000多个示例,而对于某些小组,我只有100多个示例.
The problem is, the size of training data of 34 groups are not balanced. For some groups I have 2000+ examples, while for some I only have 100+ examples.
对于某些小组,分类准确度很高.我想这些小组可能有特定的关键词来识别和分类.对于某些人来说,准确度很低,而且预测总是针对大群人.
For some small groups, the classification accuracy is quite high. I guess those groups may have specific key words to recognize and classify. While for some, the accuracy is low, and the prediction always goes to large groups.
我想知道如何处理低频示例问题".是否会简单地复制和复制小组数据工作?还是我需要选择训练数据并扩展和平衡数据大小?有什么建议吗?
I want to know how to deal with the "low frequency example problem". Would simply copy and duplicate the small group data work? Or I need to choose the training data and expand and balance the data size? Any suggestions?
推荐答案
正则化有时可以通过减少虚假相关的影响来帮助解决不平衡的类问题,但这取决于您的数据.一种解决方案是简单地对较小的类进行过度采样,或者增加较小类中的数据点的权重,以迫使分类器对其进行更多关注.
Regularization can sometimes help imbalanced class problems by reducing the effect of spurious correlation, but that depends on your data. One solution is to simply over-sample the smaller classes, or increase the weights of the data points in the smaller classes to force the classifier to pay more attention to it.
通过搜索类不平衡"问题,可以找到更高级的技术.尽管针对文本分类问题的应用/创建的数量并不多,但是在处理文本问题时拥有大量数据是非常普遍的.因此,我不确定在如此高维度的空间中有多少能很好地工作.
You can find more advanced techniques by searching for "class imbalance" problems. Though not as many of them have been applied / created for text classification problems, as it is very common to have huge amounts of data when working with text problems. So I'm not sure how many work well in such high dimensional space.
这篇关于分类中如何处理低频实例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!