What Is Convolution?
Convolution is a mathematical operation commonly used in signal processing and image analysis to extract features. In deep learning, convolution layers are used to capture local patterns of the input data.
Core Concepts:
- Kernel / Filter: A small matrix that scans across the input.
- Receptive Field: The region covered by the kernel at each step.
- Parameter Sharing: The same kernel weights are applied across the whole input, greatly reducing parameters.
Convolution Formula:
Code Example:
import torch
import torch.nn as nn
# Define a simple 2D convolution
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1)
input_tensor = torch.randn(1, 1, 5, 5) # batch=1, channel=1, height=5, width=5
output = conv(input_tensor)
print("Convolution output shape:", output.shape)
-
nn.Conv2d defines a 2D convolution layer.
in_channels=1: input has 1 channel (e.g., grayscale image).out_channels=1: output will have 1 channel.kernel_size=3: 3×3 kernel.stride=1: kernel moves 1 pixel at a time.padding=1: pads the input with one layer of zeros so output size matches input.
-
input_tensor:
- A simulated input tensor of shape
[batch_size, channels, height, width]. - Here it’s one 5×5 single-channel feature map.
- A simulated input tensor of shape
-
Convolution Operation:
conv(input_tensor)applies convolution to produce an output feature map.- Output size formula:
Substituting:
The output will be a feature map of size [1, 1, 5, 5].
Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a deep learning model designed to process grid-structured data such as images. CNNs extract hierarchical feature representations through stacked convolution layers, activation layers, and pooling layers.
Main Components:
- Convolutional Layer: Extracts spatial features.
- Activation Function: e.g., ReLU, introduces non-linearity.
- Fully Connected Layer: Maps extracted features to classes.
Code Example:
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1) # From 1 input channel to 16 feature maps
self.relu = nn.ReLU() # Non-linear activation
self.fc = nn.Linear(16 * 28 * 28, 10) # Flattened features → 10 output classes
def forward(self, x):
x = self.conv1(x) # Convolution
x = self.relu(x) # Non-linear activation
x = x.view(x.size(0), -1) # Flatten to 1D vector
x = self.fc(x) # Fully connected layer
return x
- The input image is
[28, 28], suitable for datasets like MNIST. - The final output contains 10 class scores.
Pooling, Downsampling, and Upsampling
Pooling Layer
Pooling reduces the spatial size of feature maps. Common types include Max Pooling and Average Pooling. Pooling reduces computation and increases feature robustness.
Code Example:
pool = nn.MaxPool2d(kernel_size=2, stride=2)
input_tensor = torch.randn(1, 16, 28, 28)
output = pool(input_tensor)
print("After pooling:", output.shape) # torch.Size([1, 16, 14, 14])
nn.MaxPool2d:
- Defines a max pooling layer.
-
Parameters:
kernel_size=2: window size 2×2.stride=2: moves 2 pixels each step (non-overlapping).
Downsampling and Upsampling
Downsampling uses pooling or resizing to reduce resolution.
Upsampling increases resolution, often used in generation tasks or semantic segmentation.
Code Example:
import torch.nn.functional as F
# Downsampling
downsampled = F.interpolate(input_tensor, scale_factor=0.5, mode='bilinear')
# Upsampling
upsampled = F.interpolate(input_tensor, scale_factor=2, mode='bilinear')
print("After downsampling:", downsampled.shape)
print("After upsampling:", upsampled.shape)
-
Downsampling:
scale_factor=0.5: shrinks feature map by half.mode='bilinear': smooth bilinear interpolation.
-
Upsampling:
scale_factor=2: doubles feature map size.mode='bilinear': smooth enlargement.
BatchNorm
Batch Normalization stabilizes and accelerates training by normalizing the mean and variance of intermediate activations.
Major Benefits:
- Faster convergence.
- Reduced overfitting.
- Improved training stability.
Code Example:
x = torch.randn(1, 16, 7, 7)
print("x.shape =", x.shape)
layer = nn.BatchNorm2d(num_features=16)
out = layer(x)
print("out.shape =", out.shape)
print("layer.weight.shape =", layer.weight.shape)
print("layer.bias.shape =", layer.bias.shape)
x.shape = torch.Size([1, 16, 7, 7])
out.shape = torch.Size([1, 16, 7, 7])
layer.weight.shape = torch.Size([16])
layer.bias.shape = torch.Size([16])
vars(layer) = {
'training': True,
'_parameters': OrderedDict([('weight', Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
requires_grad = True)), ('bias', Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
requires_grad = True))]),
'_buffers': OrderedDict([('running_mean', tensor([0.0283, 0.0018, -0.0207, -0.0116, -0.0092, -0.0127, 0.0054, -0.0146, -0.0152, -0.0171, -0.0153, 0.0133, 0.0122, -0.0066, 0.0116, 0.0064])), ('running_var', tensor([0.9926, 0.9942, 0.9708, 0.9982, 1.0170, 0.9690, 1.0005, 1.0011, 0.9939,
0.9892, 0.9577, 0.9949, 1.0047, 0.9766, 0.9797, 1.0248
])), ('num_batches_tracked', tensor(1))]),
'_non_persistent_buffers_set': set(),
'_backward_hooks': OrderedDict(),
'_is_full_backward_hook': None,
'_forward_hooks': OrderedDict(),
'_forward_pre_hooks': OrderedDict(),
'_state_dict_hooks': OrderedDict(),
'_load_state_dict_pre_hooks': OrderedDict(),
'_load_state_dict_post_hooks': OrderedDict(),
'_modules': OrderedDict(),
'num_features': 16,
'eps': 1e-05,
'momentum': 0.1,
'affine': True,
'track_running_stats': True
}
-
nn.BatchNorm2d:- Defines batch normalization for 2D feature maps.
num_features=16: number of channels.
-
Input Tensor:
- Example feature map with batch size 1 and 16 channels.
-
Batch Normalization Operation:
- Normalizes each channel to zero-mean, unit-variance.
- Then applies learnable scaling and shift:
- Helps faster convergence and stabilizes training.
