


I am trying to debug an issue with my classifier. The issue is that it always predicts the same class for a given input despite having close to an 80% accuracy.

我训练了CNN以检测2个班级之间的差异. A类有2575 jpeg,B类有665 jpeg.

I trained my CNN to detect the difference between 2 classes. class A has 2575 jpegs and class B has 665 jpegs.

这是否可能导致我的CNN总是预测同一类的问题?每个类中的项目数量之间是否太不平衡了?总的来说,如果我将两个类的大小设置为相同(665 jpegs),我的性能会提高吗?

Could this have caused my issue with my CNN always predicting the same class? Is this too much of an imbalance between the # of items in each class? In general, will my performance improve if I make the size of both classes the same(at 665 jpegs?)?



The problem seems to be a case of class imbalance and there are different ways to handle it:

  1. 加权损失: 您可以通过计算加权交叉熵.
  2. 重新采样数据: 如前所述,您还可以对多数类别进行下采样,以平衡类别.您还可以对少数派类别进行升采样以使其达到平均水平.
  3. 生成增强数据: :由于要处理图像,因此可以对少数类进行升采样,然后在这些图像上使用data augmentation,这也解决了类不平衡问题解决过度拟合问题并提高泛化能力.
  4. 以及以上所有内容的组合.
  1. Weighted loss: You can penalise the reward for the majority loss function by computing a weighted cross entropy.
  2. Resampling the data: As you mentioned you can also downsample the majority class, to balance the classes. You can also upsample the minority class to make it even.
  3. Generate augmented data: Since you are handling images, you can upsample the minority class and then use data augmentation on those images, this solves the class imbalance as well as tackles overfitting and improves generalisation.
  4. and Combination of all the above.


08-20 10:10