Deep Learning Equations

Author

Benedict Thekkel

1. The Basics: Neural Network as a Function

At its core, a neural network is a function $f(\mathbf{x}; \mathbf{\theta})$
- $\mathbf{x}$ = input vector
- $\mathbf{\theta}$ = model parameters (weights and biases)

It maps inputs to outputs by applying a series of linear transformations and non-linear activations.

2. Linear Transformation (Affine Function)

Each layer in a neural network computes: \[ \mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b} \]
- $\mathbf{W}$ = weight matrix
- $\mathbf{b}$ = bias vector
- $\mathbf{z}$ = pre-activation output

Example:

For a single neuron:
\[ z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b \]

3. Activation Functions

Activation functions introduce non-linearity.

Function	Equation	Purpose
Sigmoid	$\sigma(z) = \dfrac{1}{1 + e^{-z}}$	Outputs between $0$ and $1$
Tanh	$\tanh(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$	Outputs between $-1$ and $1$
ReLU	$\text{ReLU}(z) = \max(0, z)$	Introduces sparsity
Leaky ReLU	$\text{LeakyReLU}(z) = \max(\alpha z, z)$	Avoids dying neurons
Softmax	$\text{Softmax}(z_i) = \dfrac{e^{z_i}}{\sum_{j} e^{z_j}}$	Outputs probabilities

4. Forward Propagation (The Core Computation)

For a deep neural network with $L$ layers: \[ \mathbf{a}^{[0]} = \mathbf{x} \\ \mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \\ \mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]}) \\ \text{For } l = 1, 2, \dots, L \]
- $\mathbf{a}^{[l]}$ = activation of layer $l$
- $g^{[l]}$ = activation function for layer $l$

Final output:

\[ \hat{\mathbf{y}} = \mathbf{a}^{[L]} \]

5. Loss Functions (How We Measure Error)

Task	Loss Function	Equation
Regression	Mean Squared Error (MSE)	$ = _{i=1}^{m} (y^{(i)} - ^{(i)})2 $
Binary Classification	Binary Cross-Entropy (Log Loss)	$ = - _{i=1}^{m} $
Multi-class Classification	Categorical Cross-Entropy	$ = - _{i=1}^{C} y_i (_i) $

6. Backpropagation (How We Learn)

It’s gradient descent applied via the chain rule.

6.1 Compute Gradients:

\[ \dfrac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^T \\ \dfrac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \delta^{[l]} \]

6.2 Backpropagate Errors:

\[ \delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot g'(\mathbf{z}^{[l]}) \]
- $\delta^{[l]}$ = error at layer $l$
- $\odot$ = element-wise multiplication

7. Gradient Descent (Parameter Updates)

For each parameter $\theta$ (weights or biases):
\[ \theta := \theta - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta} \]
- $\alpha$ = learning rate

8. Optimizers (Variants of Gradient Descent)

Optimizer	Update Rule
SGD	$ := - $
Momentum	$ v = v - $<br>$ := + v $
Adam	Combines Momentum + RMSprop \[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \] \[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \] \[ \theta := \theta - \alpha \dfrac{m_t}{\sqrt{v_t} + \epsilon} \]

9. Regularization (Prevent Overfitting)

L2 Regularization (Weight Decay):
\[ \mathcal{L}_{\text{reg}} = \mathcal{L} + \dfrac{\lambda}{2m} \sum_{l} \|\mathbf{W}^{[l]}\|^2 \]
Dropout:
\[ \mathbf{a}^{[l]} = \mathbf{a}^{[l]} \odot \mathbf{d}^{[l]} \]
$\mathbf{d}^{[l]}$ = dropout mask

10. Convolution (CNNs)

\[ (\mathbf{I} * \mathbf{K})(i,j) = \sum_m \sum_n \mathbf{I}(i + m, j + n) \cdot \mathbf{K}(m, n) \]
- $\mathbf{I}$ = input image
- $\mathbf{K}$ = kernel (filter)

11. Recurrent Neural Networks (RNNs)

Hidden State Update:

\[ \mathbf{h}_t = g(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h) \]
### Output:
\[ \mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y \]

12. Attention Mechanism (Transformers)

Self-Attention Equation:

\[ \text{Attention}(Q, K, V) = \text{softmax} \left( \dfrac{QK^T}{\sqrt{d_k}} \right) V \]
- $Q$ = query
- $K$ = key
- $V$ = value
- $d_k$ = dimension of $K$

13. Transformer Feedforward Layer

\[ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 \]

14. Training Objective (Put it all together!)

Find $\mathbf{\theta}$ that minimizes the loss:

\[ \mathbf{\theta}^* = \underset{\mathbf{\theta}}{\arg\min} \; \mathcal{L}(f(\mathbf{x}; \mathbf{\theta}), \mathbf{y}) \]

🌍 Forward-Looking Concepts

Concept	Equation	Purpose
Contrastive Loss (CLIP)	$ = - $	Align different modalities (text, image)
Diffusion Models	$ _{t-1} = ( t - (_t, t) ) + _t $	Generate high-quality images
Reinforcement Learning	$ Q(s,a) = r + _{a’} Q(s’, a’) $	Optimize sequential decision-making