问题描述
我正在使用TensorFlow,并修改了教程示例以获取RGB图像.
I'm using TensorFlow and I modified the tutorial example to take my RGB images.
该算法在新图像集上开箱即用,直到突然之间(仍然收敛,通常精度约为92%),由于ReluGrad收到非限定值的错误而崩溃.调试表明,直到突然之间,由于不明原因,该数字都没有异常发生,引发了错误.添加
The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding
print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())
作为每个循环的调试代码,将产生以下输出:
as debug code to each loop, yields the following output:
Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
由于我的值都不是很高,因此发生NaN的唯一方法是处理不正确的0/0,但是由于本教程代码不进行任何除法或类似操作,因此我没有其他解释这来自内部TF代码.
Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.
我对此一无所知.有什么建议?该算法收敛良好,在我的验证集上的准确性稳步攀升,在迭代8600时达到了92.5%.
I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.
推荐答案
实际上,事实证明这是愚蠢的.我发布此消息是为了防止其他人遇到类似错误.
Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
实际上是一种计算交叉熵的可怕方法.在某些样本中,某些类可以在一段时间后确定地排除在外,导致该样本的y_conv = 0.这通常不是问题,因为您对此不感兴趣,但是通过在其中写入cross_entropy的方式,该特定样本/类的结果为0 * log(0).因此是NaN.
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
用
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
解决了我所有的问题.
这篇关于Tensorflow NaN错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!