问题描述
我知道在小型网络中需要偏置来改变激活函数.但是在具有多层 CNN、池化、dropout 和其他非线性激活的深度网络的情况下,Bias 真的有所作为吗? 卷积滤波器正在学习局部特征并针对给定的卷积输出通道使用相同的偏置.
I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, pooling, dropout and other non -linear activations, is Bias really making a difference? The convolutional filter is learning local features and for a given conv output channel same bias is used.
这不是这个链接.上述链接仅解释了偏差在小型神经网络中的作用,并没有试图解释偏差在包含多个 CNN 层、drop-outs、池化和非线性激活函数的深层网络中的作用.
This is not a dupe of this link. The above link only explains role of bias in small neural network and does not attempt to explain role of bias in deep-networks containing multiple CNN layers, drop-outs, pooling and non-linear activation functions.
我进行了一个简单的实验,结果表明从 conv 层去除偏差对最终测试的准确性没有影响.训练了两个模型,测试准确率几乎相同(没有偏差的一个稍微好一点.)
I ran a simple experiment and the results indicated that removing bias from conv layer made no difference in final test accuracy.There are two models trained and the test-accuracy is almost same (slightly better in one without bias.)
- model_with_bias,
- model_without_bias(转换层中未添加偏差)
是否仅出于历史原因使用它们?
Are they being used only for historical reasons?
如果使用偏差不能提高准确性,我们不应该忽略它们吗?要学习的参数更少.
如果有人比我有更深入的知识,我会很感激,可以解释这些偏见在深层网络中的重要性(如果有的话).
I would be thankful if someone who have deeper knowledge than me, could explain the significance(if- any) of these bias in deep networks.
这里是完整代码和实验结果bias-VS-no_bias 实验
Here is the complete code and the experiment result bias-VS-no_bias experiment
batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_size, image_size, num_channels))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth, depth], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
layer3_weights = tf.Variable(tf.truncated_normal(
[image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
layer4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# define a Model with bias .
def model_with_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# define a Model without bias added in the convolutional layer.
def model_without_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv ) # layer1_ bias is not added
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv) # + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
# bias are added only in Fully connected layer(layer 3 and layer 4)
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# Training computation.
logits_with_bias = model_with_bias(tf_train_dataset)
loss_with_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))
logits_without_bias = model_without_bias(tf_train_dataset)
loss_without_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))
# Optimizer.
optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)
# Predictions for the training, validation, and test data.
train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))
# Predictions for without
train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))
num_steps = 1001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
session.run(optimizer_with_bias, feed_dict=feed_dict)
session.run(optimizer_without_bias, feed_dict = feed_dict)
print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))
输出:
已初始化
测试准确率(有偏差):90.5%
Test accuracy(with bias): 90.5%
测试准确率(无偏差):90.6%
Test accuracy(without bias): 90.6%
推荐答案
在大型模型中,去除偏置输入几乎没有什么区别,因为每个节点都可以从其所有输入的平均激活中生成偏置节点,根据大数定律,这将是大致正常的.在第一层,这种情况发生的能力取决于您的输入分布.在小型网络上,您当然需要偏置输入,但在大型网络上,去除它几乎没有区别.
In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference.
虽然在大型网络中没有什么区别,但仍然取决于网络架构.例如在 LSTM 中:
Although in a large network it has no difference, it still depends on network architecture. For instance in LSTM:
LSTMs 的大多数应用程序简单地初始化 LSTMs随机权重可以很好地解决许多问题.但是这个初始化有效地将遗忘门设置为 0.5.这引入每个时间步长为 0.5 的消失梯度,每当长期依赖关系出现时,这可能会导致问题特别严重.这个问题可以通过简单地初始化来解决遗忘门偏向一个较大的值,例如 1 或 2.通过这样做,遗忘门将被初始化为一个接近于 1 的值,启用梯度流.
另见:
这篇关于卷积层的偏差真的对测试精度有影响吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!