Comprehensive Master Notes: CNN

The Architecture, Mathematics, and Flow of Convolutional Neural Networks

Anchor example: RGB image 224×224×3 → class label "cat"

Original paper: LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11).

1. Architectural Taxonomy

Modern CNNs are partitioned into structural archetypes depending on how spatial features are extracted, how depth is managed, and how computational cost is controlled.

Architecture typeCore structural ideaWhy it was inventedIndustry benchmarks
Sequential stack (LeNet, AlexNet, VGG)Conv → Pool → Conv → Pool → FCFirst proof that hierarchical learned features beat handcrafted featuresMNIST, early ImageNet
Residual networks (ResNet, ResNeXt)Skip connections bypass conv blocks: \(y = F(x) + x\)Solves degradation in very deep networks via gradient shortcutsImageNet SOTA 2015+, production backbones
Depthwise separable (MobileNet, EfficientNet)Factorize standard conv into spatial + pointwise (see §6)Reduces FLOPs by 8–9× for mobile/edge deploymentOn-device inference
Attention-augmented (ConvNeXt, CoAtNet)Merge conv local priors with global attentionCompete with Vision Transformers while retaining inductive biasModern ImageNet, COCO

2. Data Representation: From Pixels to GPU Tensors

Raw image files (JPEG/PNG) are opaque binary blobs. They must be decoded and transformed into structured floating-point tensors before any mathematical operation can execute on hardware.

Step A: Decode & Normalize

decodeJPEG bytes → uint8 tensor of shape \((H, W, 3)\), values \([0, 255]\)
normalize\(X = \dfrac{\text{pixel} - \mu}{\sigma}\) → float32, values \(\approx [-2.5, +2.5]\)
why this operation
Convolution kernels perform weighted sums. If raw values range \([0,255]\), gradient magnitudes become unstable and learning rates must be impractically small. Normalization centers the data around zero with unit variance, so weight-initialization assumptions hold and gradients flow stably from the first iteration.

Step B: Batching & Memory Layout

batch formationstack \(B\) images → tensor \((B, C, H, W) = (32, 3, 224, 224)\)
GPU transferCPU RAM → PCIe/NVLink → GPU HBM (contiguous NCHW layout)
why this operation
GPU cores hit peak throughput only on large contiguous memory blocks processed in parallel. Batching amortizes kernel-launch overhead and enables vectorized matrix ops across many images at once — turning sequential pixel processing into massively parallel linear algebra.

This yields the finalized Input Tensor \(X\) of shape \((32, 3, 224, 224)\) — ready for the first convolution.

3. Deep Dive: The Convolution Operation

The convolution layer is the feature-extraction engine of CNNs. It slides small learned filters across the spatial extent of the input, computing local weighted sums at every position.

The Mathematical Formulation

Let \(X_{\text{pad}}\) be the input zero-padded by \(P\) on each spatial border. A single output element is:

output element\(\displaystyle Y[b,k,i,j] = \beta_k + \sum_{c=0}^{C_{\text{in}}-1}\sum_{u=0}^{K-1}\sum_{v=0}^{K-1} W[k,c,u,v]\;\cdot\; X_{\text{pad}}[b,\,c,\,iS+u,\,jS+v]\)

where \(\beta_k\) is the bias for output channel \(k\). Making the padding explicit (indexing into \(X_{\text{pad}}\), not \(X\)) is what keeps \(i,j\) valid at the borders.

SymbolMeaningTypical value
\(K\)kernel spatial size (height = width)3
\(C_{\text{in}}\)number of input channels3 (RGB) or 64, 128…
\(C_{\text{out}}\)number of output filters64, 128, 256…
\(S\)stride (step of the sliding window)1 or 2
\(P\)zero-padding on input borders1 (for "same" output)
input X_pad (5×5) 3×3 Σ W·X + β output Y (3×3)
The kernel covers a \(K{\times}K\) receptive patch, multiplies-and-sums into one output cell, then strides by \(S\). Same kernel reused at every position — that is weight sharing.

Output Shape & Parameter Count

spatial output\(H_{\text{out}} = \left\lfloor \dfrac{H_{\text{in}} + 2P - K}{S} \right\rfloor + 1\),  full shape \((B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}})\)
parameters\((K^2 \cdot C_{\text{in}} \cdot C_{\text{out}}) + C_{\text{out}}\)

Numeric example: input \((32,3,224,224)\), kernel \(7{\times}7\), 64 filters, stride 2, padding 3:

output shape\((32, 64, 112, 112)\)
params\((7 \times 7 \times 3 \times 64) + 64 = 9408 + 64 = \mathbf{9472}\)
why this operation
A fully-connected layer over this input would need \((224{\cdot}224{\cdot}3) \times (112{\cdot}112{\cdot}64) \approx\) 24 billion parameters. Convolution exploits two priors — locality (nearby pixels matter most) and translation equivariance (the same edge detector works everywhere) — covering the same input with 9,472 shared weights, a reduction of over 2.5 million×.

Backward Pass: the Gradient of a Convolution is a Convolution

Given the upstream gradient \(\delta Y = \partial L/\partial Y\), the three gradients are (1-channel, stride 1, valid case for clarity):

  1. \(\dfrac{\partial L}{\partial W[k,c,u,v]} = \displaystyle\sum_{i,j} \delta Y[k,i,j]\;X_{\text{pad}}[c,\,i+u,\,j+v]\)  — a correlation of the input with the output gradient.
  2. \(\dfrac{\partial L}{\partial X[c,p,q]} = \displaystyle\sum_{k}\sum_{u,v} \delta Y[k,\,p-u,\,q-v]\;W[k,c,u,v]\)  — a full convolution of \(\delta Y\) with the \(180^\circ\)-flipped kernel.
  3. \(\dfrac{\partial L}{\partial \beta_k} = \displaystyle\sum_{i,j}\delta Y[k,i,j]\)  — sum the output gradient over space.
why this matters
Both the forward pass and the backward pass are convolutions — the weight gradient is input ⋆ output-gradient, and the input gradient is a transposed (flipped-kernel) convolution. That is why the same highly-optimized im2col + GEMM kernels accelerate training as well as inference.
hardware execution
On GPU, convolution is implemented as im2col + GEMM: input patches are unrolled into matrix columns, kernel weights form rows, and cuBLAS executes one massive matrix multiply — converting the nested-loop definition into a hardware-optimized parallel op.

4. Nonlinearity: The Activation Function

After every convolution produces a linear output tensor, a pointwise nonlinear function is applied element-by-element.

ReLU (Rectified Linear Unit)

forward\(\operatorname{ReLU}(x) = \max(0, x)\)
gradient\(\dfrac{\partial\,\operatorname{ReLU}}{\partial x} = \begin{cases} 1 & x > 0 \\ 0 & x \le 0 \end{cases}\)
tensor shapeunchanged: \((B, C, H, W) \to (B, C, H, W)\)
why this operation
Without a nonlinearity, stacking \(N\) conv layers collapses into a single linear map \(W_N \cdots W_1 X = W_{\text{eq}} X\) — the whole network would have the power of one layer. ReLU breaks this collapse with a decision boundary at zero, letting the network approximate complex nonlinear functions via composition of piecewise-linear segments.
why not sigmoid/tanh
Sigmoid/tanh saturate for large inputs (gradient → 0), causing vanishing gradients in deep stacks. ReLU has constant gradient 1 for positive inputs, enabling stable flow through 50+ layers. The trade-off — "dead neurons" (permanently zero) — is addressed by LeakyReLU/GELU.

5. Normalization: Batch Normalization

After Conv + ReLU, activation distributions drift across layers and iterations (internal covariate shift). BatchNorm stabilizes them. Statistics are computed per channel, pooled over the batch and both spatial axes.

statistics\(\mu_c = \dfrac{1}{B H W}\displaystyle\sum_{b,i,j} X[b,c,i,j], \qquad \sigma_c^2 = \dfrac{1}{B H W}\sum_{b,i,j}(X[b,c,i,j]-\mu_c)^2\)
normalize\(\hat X[b,c,i,j] = \dfrac{X[b,c,i,j]-\mu_c}{\sqrt{\sigma_c^2 + \epsilon}}\)
scale & shift\(Y[b,c,i,j] = \gamma_c\,\hat X[b,c,i,j] + \beta_c\)

\(\gamma_c\) (scale) and \(\beta_c\) (shift) are learned per channel — the network can undo the normalization if useful.

why this operation
When layer outputs have unpredictable mean/variance, the next layer's weights must constantly re-adapt to shifting inputs, slowing convergence. Forcing each channel to zero-mean/unit-variance (before the affine):
  • smooths the loss landscape, enabling 5–10× higher learning rates;
  • injects mini-batch noise that acts as implicit regularization;
  • reduces sensitivity to weight initialization.
train vs inference
Batch statistics only exist during training. BatchNorm keeps an exponential moving average of \(\mu_c, \sigma_c^2\) over training and uses those fixed running estimates at inference, so a single image produces deterministic outputs independent of its batch.

6. Spatial Downsampling & Efficiency

As features become more abstract, full spatial resolution becomes redundant and expensive. Downsampling shrinks the grid while growing the effective receptive field.

Max Pooling

operation\(Y[b,c,i,j] = \displaystyle\max_{0\le u,v < K} X[b,\,c,\,iS+u,\,jS+v]\)
typical\(K=2, S=2\) → halves each spatial dim
shape change\((B, 256, 56, 56) \to (B, 256, 28, 28)\)
input 4×4 1382 4210 5673 1249 max 2×2 output 2×2 4 8 6 9
Each colored \(2{\times}2\) block collapses to its maximum — halving resolution, keeping the strongest activation, and granting small-shift invariance.
why downsample
  • Compute: conv FLOPs scale with \(H\cdot W\); halving each dim cuts compute ~4×.
  • Receptive field: after downsampling, each later neuron "sees" a larger region of the original image.
  • Robustness: max pooling gives local shift invariance — a feature at \((10,10)\) or \((11,11)\) yields the same pooled output.

Receptive Field — the Recurrence

The receptive field is the input region influencing one output neuron. Track the cumulative stride ("jump") \(j_l\) and field \(r_l\) layer by layer:

jump\(j_0 = 1, \qquad j_l = j_{l-1}\cdot S_l\)
receptive field\(r_0 = 1, \qquad r_l = r_{l-1} + (K_l - 1)\,j_{l-1}\)
input r = 7 conv 3×3 r = 5 conv 3×3 1 neuron
Two stacked \(3{\times}3\) convs (stride 1): \(r_1 = 1+(3{-}1)\cdot1 = 3\), \(r_2 = 3+(3{-}1)\cdot1 = 5\) … add a stride-2 layer and the field jumps faster. If \(r_l\) is smaller than the object, the net cannot "see" it.

Depthwise-Separable Convolution (the MobileNet trick)

A standard conv mixes spatially and across channels at once. Separable conv factorizes these into two cheaper steps:

standard cost\(K^2 \cdot C_{\text{in}} \cdot C_{\text{out}}\)
depthwiseone \(K{\times}K\) filter per input channel: \(K^2 \cdot C_{\text{in}}\)
pointwise (1×1)mix channels: \(C_{\text{in}} \cdot C_{\text{out}}\)
separable total\(K^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}} = C_{\text{in}}(K^2 + C_{\text{out}})\)
cost ratio\(\dfrac{K^2 C_{\text{in}} + C_{\text{in}}C_{\text{out}}}{K^2 C_{\text{in}} C_{\text{out}}} = \dfrac{1}{C_{\text{out}}} + \dfrac{1}{K^2}\)
why this is ~8–9× cheaper
For \(K=3\) and a large \(C_{\text{out}}\), the ratio \(\approx \tfrac{1}{9}\) — an order-of-magnitude FLOP and parameter reduction with little accuracy loss. Depthwise handles "where" (spatial), pointwise handles "what mix of channels" — separating the two responsibilities is what makes on-device CNNs feasible.

7. Classifier Head: From Feature Maps to Decisions

After the final conv block, the network holds a feature tensor \((B, C_{\text{final}}, H_{\text{final}}, W_{\text{final}})\) that must collapse into a class prediction.

Global Average Pooling (GAP)

operation\(z[b, c] = \dfrac{1}{H W}\displaystyle\sum_{i,j} X[b, c, i, j]\)
shape change\((B, 2048, 7, 7) \to (B, 2048)\)

Linear Layer, Softmax & Cross-Entropy

logits\(\text{logits} = z\,W_{\text{cls}} + b_{\text{cls}}, \quad (B, 2048)\times(2048, C) \to (B, C)\)
params (ImageNet)\(2048 \times 1000 + 1000 = \mathbf{2{,}049{,}000}\)
softmax\(P(y{=}k\mid x) = \dfrac{e^{\text{logit}_k}}{\sum_j e^{\text{logit}_j}}\)
cross-entropy\(L = -\log P(y_{\text{true}}\mid x) = -\,\text{logit}_{y_{\text{true}}} + \log\textstyle\sum_j e^{\text{logit}_j}\)
why GAP instead of flatten
Flattening \((2048, 7, 7)\) gives a 100,352-d vector; an FC layer to 1000 classes would need ~100 million parameters — heavy overfitting and memory. GAP averages each channel to a single scalar → a compact 2048-d vector, eliminating ~98% of head parameters while acting as structural regularization.

8. End-to-End Operational Lifecycle

[RGB Image: 224×224×3 pixels]
        │
        ▼
[Decode + Normalize + Batch] ──────────────► Tensor: (32, 3, 224, 224)
        │
        ▼
[Conv 7×7, 64 filters, stride 2 + BN + ReLU] ► (32, 64, 112, 112)
        │
        ▼
[MaxPool 3×3, stride 2] ──────────────────► (32, 64, 56, 56)
        │
        ▼
[Conv Block Stack: channels 64→128→256→512] ► (32, 512, 7, 7)
        │
        ▼
[Global Average Pooling] ─────────────────► (32, 512)
        │
        ▼
[Linear Layer: 512 → 1000 classes] ───────► (32, 1000) logits
        │
        ▼
[Softmax → Argmax] ───────────────────────► Prediction: "cat"

TRAINING PATH:
  logits + true_label → Cross-Entropy Loss → Backprop → Optimizer Step
StageTensor shapeWhat it representsReceptive field
Input(B, 3, 224, 224)Raw RGB pixel intensities1×1 px
After stem conv(B, 64, 112, 112)Low-level edges, color gradients7×7 px
After stage 2(B, 128, 56, 56)Textures, corners, simple patterns~35×35 px
After stage 3(B, 256, 28, 28)Object parts, repeated motifs~91×91 px
After stage 4(B, 512, 7, 7)Semantic object-level features~224×224 px (global)
After GAP(B, 512)Image-level feature summaryFull image
Logits(B, 1000)Raw unnormalized class scores

9. Interview Depth Q&A

QuestionStrong answer pattern
Why CNN over MLP for images?CNNs encode locality (nearby pixels correlate) and translation equivariance (same pattern anywhere), cutting parameters by millions while improving generalization.
What is the receptive field and why does it matter?The input region influencing one output neuron: \(r_l = r_{l-1} + (K_l-1)\,j_{l-1}\). It must exceed the object size or the network cannot recognize it.
Purpose of a 1×1 convolution?Channel-wise linear combination with no spatial mixing — a per-pixel FC layer. Used for channel reduction (bottlenecks), expansion, and the pointwise step of depthwise-separable conv.
Why does ResNet beat deeper VGG?Skip connections give \(\partial L/\partial x = \partial L/\partial y\,(1 + \partial F/\partial x)\). The additive \(1\) prevents vanishing regardless of \(F\)'s Jacobian, enabling 100+ layer training.
Is the backward pass of a conv also a conv?Yes: \(\partial L/\partial W\) is input ⋆ output-gradient, and \(\partial L/\partial X\) is a full convolution of the output-gradient with the flipped kernel. Same GEMM kernels accelerate both directions.
How does depthwise-separable save compute?It factorizes \(K^2 C_{\text{in}} C_{\text{out}}\) into \(K^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}\); ratio \(= \tfrac{1}{C_{\text{out}}} + \tfrac{1}{K^2} \approx \tfrac19\) for \(3{\times}3\).