问题描述
我正在阅读很多说明两件事的教程.
- "[用卷积层替换完全连接的层]将它们转换为采用任何大小的输入和输出分类图的全卷积网络." 用于语义分割的完全卷积网络,Shelhamer等人.
- 传统的CNN不能做到这一点,因为它具有完全连接的层,并且其形状由输入图像的大小决定.
基于这些陈述,我的问题如下?
- 每当我创建FCN时,我只能使其在固定尺寸的输入图像上进行训练和测试. 但是在论文的摘要中,他们指出:我们的主要见识是建立完全卷积"的网络,该网络可以接受任意大小的输入,并通过有效的推理和学习产生相应大小的输出." 可能第一层具有固定数量的权重,并且不同大小的输入图像将无法正确链接到这些权重.
- 输入图像的大小如何精确确定完全连接的图层?我尝试在网上查找,但找不到直接答案.
似乎您在混淆图像/功能图的空间尺寸(高度和宽度),而通道尺寸"则是信息的尺寸每个像素存储.
输入图像可以具有任意的高度和宽度,但始终具有固定的通道"尺寸= 3;也就是说,每个像素的固定尺寸均为3,这是每个像素颜色的RGB值.
让我们将输入形状表示为3xHxW
(3个RGB通道,按高度H乘以宽度W).
对kernel_size=5
和output_channel=64
应用卷积意味着您有64个大小为3x5x5的滤镜.对于每个滤镜,您将获取图像中所有重叠的3x5x5
窗口(RGB x 5 x 5像素),并为每个滤镜输出一个数字,该数字是输入RGB值的加权总和.对所有64个滤镜执行此操作将为每个滑动窗口提供64个通道,或者为形状为64x(H-4)x(W-4)
的输出特征图.
具有kernel_size=3
和output_channels=128
的其他卷积层将对输入特征图os形状为64x(H-4)x(W-4)
的所有3x3滑动窗口应用128个形状为64x3x3
的滤镜,从而得到形状为128x(H-6)x(W-6)
.
您可以通过类似的方式进行附加的卷积甚至池化层.
这篇文章关于卷积/合并层如何影响要素图的形状有很好的解释.
回顾一下,只要不更改输入通道的数量,就可以对任意 spatial 尺寸的图像应用完全卷积的网络,从而得到不同的结果.输出要素地图的 spatial 形状,但始终具有相同数量的 channels .
对于完全连接的(又是内积/线性)层;该层不关心空间尺寸或通道尺寸.完全连接的层的输入被展平",然后权重的数量由输入元素的数量(通道和空间组合)和输出的数量确定.
例如,在VGG网络中,当在3x224x224
图像上进行训练时,最后的卷积层将输出形状为512x7x7
的特征图,然后将其展平为25,088尺寸的向量,并馈入具有4,096个输出的完全连接的层中. /p>
如果要向VGG提供不同空间尺寸的输入图像(例如3x256x256
),则最后一个卷积层将输出形状为512x8x8
的要素图-请注意,通道尺寸512不变,但是空间尺寸从7x7增长到8x8.现在,如果您要展平"此要素贴图,您将为完全连接的图层提供32,768尺寸的输入矢量,但是,可惜,您的完全连接的图层需要25,088的尺寸输入:您将得到RunTimeError
.
如果要使用kernel_size=7
和output_channels=4096
将完全连接的层转换为卷积层,它将对512x7x7
输入要素图执行完全相同的数学运算,以生成4096x1x1
输出要素.
但是,当您为其提供512x8x8
特征图时,它不会产生错误,而是输出4096x2x2
输出特征图-调整空间尺寸,固定通道数.
I am reading a lot of tutorials that state two things.
- "[Replacing fully connected layers with convolutional layers] casts them into fully convolutional networks that take input of any size and output classification maps." Fully Convolutional Networks for Semantic Segmentation, Shelhamer et al.
- A traditional CNN can't do this because it has a fully connected layer and it's shape is decided by the input image size.
Based on these statements, my questions are the following?
- Whenever I've made a FCN, I could only get it to work with a fixed dimension of input images for both training and testing. But in the paper's abstract, they note: "Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning." How is this possible the first layer has a fixed number of weights, and an input image of different sizes would not properly link to these weights.
- How exactly does the input image size determine the fully connected layer? I tried looking online, but couldn't find a direct answer.
It seems like you are confusion spatial dimensions (height and width) of an image/feature map, and the "channel dimension" which is the dimension of the information stored per pixel.
An input image can have arbitrary height and width, but will always have a fixed "channel" dimension = 3; That is, each pixel has a fixed dimension of 3, which are the RGB values of the color of each pixel.
Let's denote the input shape as 3xHxW
(3 RGB channels, by height H by width W).
Applying a convolution with kernel_size=5
and output_channel=64
, means that you have 64 filters of size 3x5x5. For each filter you take all overlapping 3x5x5
windows in the image (RGB by 5 by 5 pixels) and output a single number per filter which is the weighted sum of the input RGB values. Doing so for all 64 filters will give you 64 channels per sliding window, or an output feature map of shape 64x(H-4)x(W-4)
.
Additional convolution layer with, say kernel_size=3
and output_channels=128
will have 128 filters of shape 64x3x3
applied to all 3x3 sliding windows in the input feature map os shape 64x(H-4)x(W-4)
resulting with an output feature map of shape 128x(H-6)x(W-6)
.
You can continue in a similar way with additional convolution and even pooling layers.
This post has a very good explanation on how convolution/pooling layers affect the shapes of the feature maps.
To recap, as long as you do not change the number of input channels, you can apply a fully convolutional net to images of arbitrary spatial dimensions, resulting with different spatial shapes of the output feature maps, but always with the same number of channels.
As for a fully connected (aka inner-product/linear) layer; this layer does not care about spatial dimensions or channel dimensions. The input to a fully connected layer is "flattened" and then the number of weights are determined by the number of input elements (channel and spatial combined) and the number of outputs.
For instance, in a VGG network, when training on 3x224x224
images, the last convolution layer outputs a feature map of shape 512x7x7
which is than flattened to a 25,088 dimensional vector and fed into a fully connected layer with 4,096 outputs.
If you were to feed VGG with input images of different spatial dimensions, say 3x256x256
, your last convolution layer will output a feature map of shape 512x8x8
-- note how the channel dimension, 512, did not change, but the spatial dimensions grew from 7x7 to 8x8. Now, if you were to "flatten" this feature map you will have a 32,768 dimensional input vector for your fully connected layer, but alas, your fully connected layer expects a 25,088 dimensional input: You will get a RunTimeError
.
If you were to convert your fully connected layer to a convolutional layer with kernel_size=7
and output_channels=4096
it will do exactly the same mathematical operation on the 512x7x7
input feature map, to produce a 4096x1x1
output feature.
However, when you feed it a 512x8x8
feature map it will not produce an error, but rather output a 4096x2x2
output feature map - spatial dimensions adjusted, number of channels fixed.
这篇关于输入图像尺寸如何影响完全连接层的尺寸和形状?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!