Modern CNNs are partitioned into structural archetypes depending on how spatial features are extracted, how depth is managed, and how computational cost is controlled.
Architecture type
Core structural idea
Why it was invented
Industry benchmarks
Sequential stack (LeNet, AlexNet, VGG)
Conv → Pool → Conv → Pool → FC
First proof that hierarchical learned features beat handcrafted features
Solves degradation in very deep networks via gradient shortcuts
ImageNet SOTA 2015+, production backbones
Depthwise separable (MobileNet, EfficientNet)
Factorize standard conv into spatial + pointwise (see §6)
Reduces FLOPs by 8–9× for mobile/edge deployment
On-device inference
Attention-augmented (ConvNeXt, CoAtNet)
Merge conv local priors with global attention
Compete with Vision Transformers while retaining inductive bias
Modern ImageNet, COCO
2. Data Representation: From Pixels to GPU Tensors
Raw image files (JPEG/PNG) are opaque binary blobs. They must be decoded and transformed into structured floating-point tensors before any mathematical operation can execute on hardware.
why this operation Convolution kernels perform weighted sums. If raw values range \([0,255]\), gradient magnitudes become unstable and learning rates must be impractically small. Normalization centers the data around zero with unit variance, so weight-initialization assumptions hold and gradients flow stably from the first iteration.
why this operation GPU cores hit peak throughput only on large contiguous memory blocks processed in parallel. Batching amortizes kernel-launch overhead and enables vectorized matrix ops across many images at once — turning sequential pixel processing into massively parallel linear algebra.
This yields the finalized Input Tensor \(X\) of shape \((32, 3, 224, 224)\) — ready for the first convolution.
3. Deep Dive: The Convolution Operation
The convolution layer is the feature-extraction engine of CNNs. It slides small learned filters across the spatial extent of the input, computing local weighted sums at every position.
The Mathematical Formulation
Let \(X_{\text{pad}}\) be the input zero-padded by \(P\) on each spatial border. A single output element is:
where \(\beta_k\) is the bias for output channel \(k\). Making the padding explicit (indexing into \(X_{\text{pad}}\), not \(X\)) is what keeps \(i,j\) valid at the borders.
Symbol
Meaning
Typical value
\(K\)
kernel spatial size (height = width)
3
\(C_{\text{in}}\)
number of input channels
3 (RGB) or 64, 128…
\(C_{\text{out}}\)
number of output filters
64, 128, 256…
\(S\)
stride (step of the sliding window)
1 or 2
\(P\)
zero-padding on input borders
1 (for "same" output)
The kernel covers a \(K{\times}K\) receptive patch, multiplies-and-sums into one output cell, then strides by \(S\). Same kernel reused at every position — that is weight sharing.
why this operation A fully-connected layer over this input would need \((224{\cdot}224{\cdot}3) \times (112{\cdot}112{\cdot}64) \approx\) 24 billion parameters. Convolution exploits two priors — locality (nearby pixels matter most) and translation equivariance (the same edge detector works everywhere) — covering the same input with 9,472 shared weights, a reduction of over 2.5 million×.
Backward Pass: the Gradient of a Convolution is a Convolution
Given the upstream gradient \(\delta Y = \partial L/\partial Y\), the three gradients are (1-channel, stride 1, valid case for clarity):
\(\dfrac{\partial L}{\partial W[k,c,u,v]} = \displaystyle\sum_{i,j} \delta Y[k,i,j]\;X_{\text{pad}}[c,\,i+u,\,j+v]\) — a correlation of the input with the output gradient.
\(\dfrac{\partial L}{\partial X[c,p,q]} = \displaystyle\sum_{k}\sum_{u,v} \delta Y[k,\,p-u,\,q-v]\;W[k,c,u,v]\) — a full convolution of \(\delta Y\) with the \(180^\circ\)-flipped kernel.
\(\dfrac{\partial L}{\partial \beta_k} = \displaystyle\sum_{i,j}\delta Y[k,i,j]\) — sum the output gradient over space.
why this matters Both the forward pass and the backward pass are convolutions — the weight gradient is input ⋆ output-gradient, and the input gradient is a transposed (flipped-kernel) convolution. That is why the same highly-optimized im2col + GEMM kernels accelerate training as well as inference.
hardware execution On GPU, convolution is implemented as im2col + GEMM: input patches are unrolled into matrix columns, kernel weights form rows, and cuBLAS executes one massive matrix multiply — converting the nested-loop definition into a hardware-optimized parallel op.
4. Nonlinearity: The Activation Function
After every convolution produces a linear output tensor, a pointwise nonlinear function is applied element-by-element.
why this operation Without a nonlinearity, stacking \(N\) conv layers collapses into a single linear map \(W_N \cdots W_1 X = W_{\text{eq}} X\) — the whole network would have the power of one layer. ReLU breaks this collapse with a decision boundary at zero, letting the network approximate complex nonlinear functions via composition of piecewise-linear segments.
why not sigmoid/tanh Sigmoid/tanh saturate for large inputs (gradient → 0), causing vanishing gradients in deep stacks. ReLU has constant gradient 1 for positive inputs, enabling stable flow through 50+ layers. The trade-off — "dead neurons" (permanently zero) — is addressed by LeakyReLU/GELU.
5. Normalization: Batch Normalization
After Conv + ReLU, activation distributions drift across layers and iterations (internal covariate shift). BatchNorm stabilizes them. Statistics are computed per channel, pooled over the batch and both spatial axes.
\(\gamma_c\) (scale) and \(\beta_c\) (shift) are learned per channel — the network can undo the normalization if useful.
why this operation When layer outputs have unpredictable mean/variance, the next layer's weights must constantly re-adapt to shifting inputs, slowing convergence. Forcing each channel to zero-mean/unit-variance (before the affine):
smooths the loss landscape, enabling 5–10× higher learning rates;
injects mini-batch noise that acts as implicit regularization;
reduces sensitivity to weight initialization.
train vs inference Batch statistics only exist during training. BatchNorm keeps an exponential moving average of \(\mu_c, \sigma_c^2\) over training and uses those fixed running estimates at inference, so a single image produces deterministic outputs independent of its batch.
6. Spatial Downsampling & Efficiency
As features become more abstract, full spatial resolution becomes redundant and expensive. Downsampling shrinks the grid while growing the effective receptive field.
Each colored \(2{\times}2\) block collapses to its maximum — halving resolution, keeping the strongest activation, and granting small-shift invariance.
why downsample
Compute: conv FLOPs scale with \(H\cdot W\); halving each dim cuts compute ~4×.
Receptive field: after downsampling, each later neuron "sees" a larger region of the original image.
Robustness: max pooling gives local shift invariance — a feature at \((10,10)\) or \((11,11)\) yields the same pooled output.
Receptive Field — the Recurrence
The receptive field is the input region influencing one output neuron. Track the cumulative stride ("jump") \(j_l\) and field \(r_l\) layer by layer:
Two stacked \(3{\times}3\) convs (stride 1): \(r_1 = 1+(3{-}1)\cdot1 = 3\), \(r_2 = 3+(3{-}1)\cdot1 = 5\) … add a stride-2 layer and the field jumps faster. If \(r_l\) is smaller than the object, the net cannot "see" it.
Depthwise-Separable Convolution (the MobileNet trick)
A standard conv mixes spatially and across channels at once. Separable conv factorizes these into two cheaper steps:
why this is ~8–9× cheaper For \(K=3\) and a large \(C_{\text{out}}\), the ratio \(\approx \tfrac{1}{9}\) — an order-of-magnitude FLOP and parameter reduction with little accuracy loss. Depthwise handles "where" (spatial), pointwise handles "what mix of channels" — separating the two responsibilities is what makes on-device CNNs feasible.
7. Classifier Head: From Feature Maps to Decisions
After the final conv block, the network holds a feature tensor \((B, C_{\text{final}}, H_{\text{final}}, W_{\text{final}})\) that must collapse into a class prediction.
why GAP instead of flatten Flattening \((2048, 7, 7)\) gives a 100,352-d vector; an FC layer to 1000 classes would need ~100 million parameters — heavy overfitting and memory. GAP averages each channel to a single scalar → a compact 2048-d vector, eliminating ~98% of head parameters while acting as structural regularization.
TRAINING PATH:
logits + true_label → Cross-Entropy Loss → Backprop → Optimizer Step
Stage
Tensor shape
What it represents
Receptive field
Input
(B, 3, 224, 224)
Raw RGB pixel intensities
1×1 px
After stem conv
(B, 64, 112, 112)
Low-level edges, color gradients
7×7 px
After stage 2
(B, 128, 56, 56)
Textures, corners, simple patterns
~35×35 px
After stage 3
(B, 256, 28, 28)
Object parts, repeated motifs
~91×91 px
After stage 4
(B, 512, 7, 7)
Semantic object-level features
~224×224 px (global)
After GAP
(B, 512)
Image-level feature summary
Full image
Logits
(B, 1000)
Raw unnormalized class scores
—
9. Interview Depth Q&A
Question
Strong answer pattern
Why CNN over MLP for images?
CNNs encode locality (nearby pixels correlate) and translation equivariance (same pattern anywhere), cutting parameters by millions while improving generalization.
What is the receptive field and why does it matter?
The input region influencing one output neuron: \(r_l = r_{l-1} + (K_l-1)\,j_{l-1}\). It must exceed the object size or the network cannot recognize it.
Purpose of a 1×1 convolution?
Channel-wise linear combination with no spatial mixing — a per-pixel FC layer. Used for channel reduction (bottlenecks), expansion, and the pointwise step of depthwise-separable conv.
Why does ResNet beat deeper VGG?
Skip connections give \(\partial L/\partial x = \partial L/\partial y\,(1 + \partial F/\partial x)\). The additive \(1\) prevents vanishing regardless of \(F\)'s Jacobian, enabling 100+ layer training.
Is the backward pass of a conv also a conv?
Yes: \(\partial L/\partial W\) is input ⋆ output-gradient, and \(\partial L/\partial X\) is a full convolution of the output-gradient with the flipped kernel. Same GEMM kernels accelerate both directions.
How does depthwise-separable save compute?
It factorizes \(K^2 C_{\text{in}} C_{\text{out}}\) into \(K^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}\); ratio \(= \tfrac{1}{C_{\text{out}}} + \tfrac{1}{K^2} \approx \tfrac19\) for \(3{\times}3\).