A map of the major neural network families: what they are, what they are good at, and how to choose between them. Each type has its own notebook (01-10) with theory and runnable PyTorch examples.
Author
Benedict Thekkel
How to use this series
Each notebook is self-contained and follows the same shape: intuition, the core math, a from-scratch implementation, the idiomatic PyTorch version, a minimal training loop, and a strengths/weaknesses summary.
#
Type
One-line role
01
FNN / MLP
Dense mapping of fixed-size vectors
02
CNN
Spatial / grid data (images)
03
RNN
Sequences via a recurrent hidden state
04
LSTM
Long sequences with gated memory
05
GRU
Leaner gated recurrence
06
Autoencoder
Unsupervised compression / generation
07
GAN
Adversarial generative modeling
08
Transformer
Attention over sequences and patches
09
RBM
Energy-based stochastic feature learning
10
DBN
Stacked RBMs with greedy pretraining
A mental model of the families
Three broad lineages cover almost everything below:
Feedforward (FNN, CNN, Transformer): one pass input to output, no recurrence. CNNs add spatial inductive bias; Transformers add attention.
Recurrent (RNN, LSTM, GRU): carry a hidden state across time steps. Largely superseded by Transformers for NLP, still useful for streaming / low-latency / small sequence tasks.
Generative / energy-based (Autoencoder, GAN, RBM, DBN): learn the data distribution rather than a label map. Modern generation leans on VAEs, GANs, and diffusion; RBM/DBN are mostly of historical importance.
Choosing a model by data type
Your data
Start with
Why
Tabular / fixed vectors
FNN (MLP)
No spatial or temporal structure to exploit
Images / grids
CNN, or ViT (Transformer)
Local receptive fields and weight sharing
Text / tokens
Transformer
Parallel, long-range attention; LSTM/GRU if tiny or streaming
Time series
LSTM / GRU / Temporal CNN / Transformer
Depends on horizon and latency
Unlabeled, want features
Autoencoder / RBM
Reconstruction or energy-based pretraining
Want to synthesize samples
GAN / VAE / Diffusion
Learn and sample the data distribution
Parameter and compute intuition
import torch.nn as nndef count_params(m):returnsum(p.numel() for p in m.parameters() if p.requires_grad)# Same input width (64), same hidden (128): compare a few building blocksmlp = nn.Linear(64, 128) # densernn = nn.RNN(64, 128, batch_first=True) # 1 gategru = nn.GRU(64, 128, batch_first=True) # 3 gateslstm = nn.LSTM(64, 128, batch_first=True) # 4 gatesfor name, m in [('Linear', mlp), ('RNN', rnn), ('GRU', gru), ('LSTM', lstm)]:print(f'{name:7} params = {count_params(m):,}')# Rough ratio RNN:GRU:LSTM is about 1:3:4 (gate count drives parameter count)
Start simple. A well-tuned MLP or CNN is a strong baseline before reaching for attention.
Match the inductive bias to the data: convolution for locality, attention for global context, recurrence for strict ordering with small state.
For generation today, prefer VAE / GAN / diffusion over RBM / DBN; the latter two are best understood as the historical bridge (Hinton 2006) that revived deep learning.
Transformers dominate at scale but cost O(n^2) in sequence length; recurrent models are O(n) memory and shine for long streaming inputs.
Next
Open the numbered notebooks in order, or jump straight to the family you need. The Model Design/ subfolder contains end-to-end build-from-scratch case studies (MNIST MLP, CNN, ResNet18).