Deep Learning Equations
1. The Basics: Neural Network as a Function
At its core, a neural network is a function \(f(\mathbf{x}; \mathbf{\theta})\)
- \(\mathbf{x}\) = input vector
- \(\mathbf{\theta}\) = model parameters (weights and biases)
It maps inputs to outputs by applying a series of linear transformations and non-linear activations.
2. Linear Transformation (Affine Function)
Each layer in a neural network computes: \[
\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}
\]
- \(\mathbf{W}\) = weight matrix
- \(\mathbf{b}\) = bias vector
- \(\mathbf{z}\) = pre-activation output
Example:
For a single neuron:
\[
z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b
\]
3. Activation Functions
Activation functions introduce non-linearity.
Function | Equation | Purpose |
---|---|---|
Sigmoid | \(\sigma(z) = \dfrac{1}{1 + e^{-z}}\) | Outputs between \(0\) and \(1\) |
Tanh | \(\tanh(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\) | Outputs between \(-1\) and \(1\) |
ReLU | \(\text{ReLU}(z) = \max(0, z)\) | Introduces sparsity |
Leaky ReLU | \(\text{LeakyReLU}(z) = \max(\alpha z, z)\) | Avoids dying neurons |
Softmax | \(\text{Softmax}(z_i) = \dfrac{e^{z_i}}{\sum_{j} e^{z_j}}\) | Outputs probabilities |
4. Forward Propagation (The Core Computation)
For a deep neural network with \(L\) layers: \[
\mathbf{a}^{[0]} = \mathbf{x} \\
\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \\
\mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]}) \\
\text{For } l = 1, 2, \dots, L
\]
- \(\mathbf{a}^{[l]}\) = activation of layer \(l\)
- \(g^{[l]}\) = activation function for layer \(l\)
Final output:
\[ \hat{\mathbf{y}} = \mathbf{a}^{[L]} \]
5. Loss Functions (How We Measure Error)
Task | Loss Function | Equation |
---|---|---|
Regression | Mean Squared Error (MSE) | \(\mathcal{L} = \dfrac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2\) |
Binary Classification | Binary Cross-Entropy (Log Loss) | \(\mathcal{L} = - \dfrac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]\) |
Multi-class Classification | Categorical Cross-Entropy | \(\mathcal{L} = - \sum_{i=1}^{C} y_i \log(\hat{y}_i)\) |
6. Backpropagation (How We Learn)
Itβs gradient descent applied via the chain rule.
6.1 Compute Gradients:
\[ \dfrac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^T \\ \dfrac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \delta^{[l]} \]
6.2 Backpropagate Errors:
\[
\delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot g'(\mathbf{z}^{[l]})
\]
- \(\delta^{[l]}\) = error at layer \(l\)
- \(\odot\) = element-wise multiplication
7. Gradient Descent (Parameter Updates)
For each parameter \(\theta\) (weights or biases):
\[
\theta := \theta - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta}
\]
- \(\alpha\) = learning rate
8. Optimizers (Variants of Gradient Descent)
Optimizer | Update Rule |
---|---|
SGD | \(\theta := \theta - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta}\) |
Momentum | \(v = \beta v - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta}\) \(\theta := \theta + v\) |
Adam | Combines Momentum + RMSprop \[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \] \[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \] \[ \theta := \theta - \alpha \dfrac{m_t}{\sqrt{v_t} + \epsilon} \] |
9. Regularization (Prevent Overfitting)
- L2 Regularization (Weight Decay):
\[ \mathcal{L}_{\text{reg}} = \mathcal{L} + \dfrac{\lambda}{2m} \sum_{l} \|\mathbf{W}^{[l]}\|^2 \] - Dropout:
\[ \mathbf{a}^{[l]} = \mathbf{a}^{[l]} \odot \mathbf{d}^{[l]} \]
- \(\mathbf{d}^{[l]}\) = dropout mask
10. Convolution (CNNs)
\[
(\mathbf{I} * \mathbf{K})(i,j) = \sum_m \sum_n \mathbf{I}(i + m, j + n) \cdot \mathbf{K}(m, n)
\]
- \(\mathbf{I}\) = input image
- \(\mathbf{K}\) = kernel (filter)
11. Recurrent Neural Networks (RNNs)
12. Attention Mechanism (Transformers)
Self-Attention Equation:
\[
\text{Attention}(Q, K, V) = \text{softmax} \left( \dfrac{QK^T}{\sqrt{d_k}} \right) V
\]
- \(Q\) = query
- \(K\) = key
- \(V\) = value
- \(d_k\) = dimension of \(K\)
13. Transformer Feedforward Layer
\[ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 \]
14. Training Objective (Put it all together!)
Find \(\mathbf{\theta}\) that minimizes the loss:
\[ \mathbf{\theta}^* = \underset{\mathbf{\theta}}{\arg\min} \; \mathcal{L}(f(\mathbf{x}; \mathbf{\theta}), \mathbf{y}) \]
π Forward-Looking Concepts
Concept | Equation | Purpose |
---|---|---|
Contrastive Loss (CLIP) | \(\mathcal{L} = -\log \dfrac{\exp(\text{sim}(\mathbf{x}, \mathbf{y})/\tau)}{\sum_{i=1}^{N} \exp(\text{sim}(\mathbf{x}, \mathbf{y}_i)/\tau)}\) | Align different modalities (text, image) |
Diffusion Models | \(\mathbf{x}_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \dfrac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(\mathbf{x}_t, t) \right) + \sigma_t \mathbf{z}\) | Generate high-quality images |
Reinforcement Learning | \(Q(s,a) = r + \gamma \max_{a'} Q(s', a')\) | Optimize sequential decision-making |