Backpropagation Functions
4 gradient computation functions for automatic differentiation and backpropagation.
Overview
Backpropagation is the algorithm used to train neural networks by computing gradients of the loss function with respect to model parameters. Charl provides automatic differentiation (autograd) helper functions for common operations.
Key Concept: Gradients indicate how to adjust weights to reduce loss. These functions compute those gradients automatically.
Training Flow
- Forward Pass: Compute predictions from inputs
- Calculate Loss: Compare predictions to targets
- Backward Pass: Compute gradients using autograd functions
- Update Weights: Adjust parameters using optimizer
- Repeat: Iterate until convergence
Quick Reference
| Function | Description |
|---|---|
| autograd_compute_linear_grad() | Gradients for linear layers |
| autograd_compute_relu_grad() | Gradients for ReLU activation |
| autograd_compute_sigmoid_grad() | Gradients for sigmoid activation |
| autograd_compute_mse_grad() | Gradients for MSE loss |
autograd_compute_linear_grad()
autograd_compute_linear_grad() → Tensor
Computes gradients for a linear (fully connected) layer. This function calculates how the loss changes with respect to the weights and biases of a linear transformation.
Mathematical Background
For linear layer: y = xW + b
- Gradient w.r.t. weights: ∂L/∂W = xᵀ × ∂L/∂y
- Gradient w.r.t. bias: ∂L/∂b = ∂L/∂y
- Gradient w.r.t. input: ∂L/∂x = ∂L/∂y × Wᵀ
Returns
| Type | Description |
|---|---|
| Tensor | Gradient tensor for weight updates |
Example: Training Linear Layer
// Initialize layer parameters
let input_size = 4
let output_size = 2
let weights = tensor_randn([input_size, output_size])
let bias = tensor_zeros([output_size])
// Training data
let x = tensor([1.0, 2.0, 3.0, 4.0])
let target = tensor([1.0, 0.0])
let learning_rate = 0.01
let epochs = 100
let epoch = 0
while epoch < epochs {
// Forward pass
let output = nn_linear(x, weights, bias)
let predictions = nn_sigmoid(output)
// Calculate loss
let loss = nn_mse_loss(predictions, target)
// Backward pass - compute gradients
let grad_weights = autograd_compute_linear_grad()
// Update weights
let weights = optim_sgd_step(weights, grad_weights, learning_rate)
if epoch % 20 == 0 {
print("Epoch " + str(epoch) + ", Loss: " + str(loss))
}
let epoch = epoch + 1
}
print("Training complete!")
autograd_compute_relu_grad()
autograd_compute_relu_grad() → Tensor
Computes gradients for the ReLU (Rectified Linear Unit) activation function. ReLU is defined as max(0, x).
Mathematical Background
ReLU function: f(x) = max(0, x)
Derivative:
- ∂f/∂x = 1 if x > 0
- ∂f/∂x = 0 if x ≤ 0
Returns
| Type | Description |
|---|---|
| Tensor | Gradient tensor (1 for positive, 0 for negative) |
Example: Network with ReLU
// Two-layer network with ReLU
let w1 = tensor_randn([4, 8])
let b1 = tensor_zeros([8])
let w2 = tensor_randn([8, 1])
let b2 = tensor_zeros([1])
let x = tensor([1.0, 2.0, 3.0, 4.0])
let target = tensor([5.0])
let learning_rate = 0.01
let epoch = 0
while epoch < 50 {
// Forward pass
let h1 = nn_linear(x, w1, b1)
let a1 = nn_relu(h1) // ReLU activation
let output = nn_linear(a1, w2, b2)
// Loss
let loss = nn_mse_loss(output, target)
// Backward pass
// Gradient through second layer
let grad_w2 = autograd_compute_linear_grad()
// Gradient through ReLU
let grad_relu = autograd_compute_relu_grad()
// Gradient through first layer
let grad_w1 = autograd_compute_linear_grad()
// Update weights
let w2 = optim_sgd_step(w2, grad_w2, learning_rate)
let w1 = optim_sgd_step(w1, grad_w1, learning_rate)
if epoch % 10 == 0 {
print("Epoch " + str(epoch) + ", Loss: " + str(loss))
}
let epoch = epoch + 1
}
Dead ReLU Problem: Neurons with ReLU can "die" during training if they always output 0. Consider using learning rate scheduling or other activation functions if this occurs.
autograd_compute_sigmoid_grad()
autograd_compute_sigmoid_grad() → Tensor
Computes gradients for the sigmoid activation function. Sigmoid squashes values to the range [0, 1].
Mathematical Background
Sigmoid function: σ(x) = 1 / (1 + e⁻ˣ)
Derivative: ∂σ/∂x = σ(x) × (1 - σ(x))
Returns
| Type | Description |
|---|---|
| Tensor | Gradient tensor based on sigmoid derivative |
Example: Binary Classification
// Binary classifier with sigmoid output
let weights = tensor_randn([4, 1])
let bias = tensor_zeros([1])
let x = tensor([0.5, 1.0, 0.8, 0.3])
let target = tensor([1.0]) // Binary label
let learning_rate = 0.1
let epoch = 0
while epoch < 100 {
// Forward pass
let logits = nn_linear(x, weights, bias)
let prediction = nn_sigmoid(logits)
// Loss
let loss = nn_cross_entropy_loss(prediction, target)
// Backward pass
let grad_sigmoid = autograd_compute_sigmoid_grad()
let grad_weights = autograd_compute_linear_grad()
// Update
let weights = optim_sgd_step(weights, grad_weights, learning_rate)
if epoch % 20 == 0 {
print("Epoch " + str(epoch) + ", Loss: " + str(loss))
print("Prediction: " + str(tensor_get(prediction, 0)))
}
let epoch = epoch + 1
}
// Final prediction
let final_logits = nn_linear(x, weights, bias)
let final_pred = nn_sigmoid(final_logits)
print("Final prediction: " + str(tensor_get(final_pred, 0)))
Use Case: Sigmoid is commonly used as the final activation for binary classification tasks. For multi-class problems, use softmax instead.
autograd_compute_mse_grad()
autograd_compute_mse_grad() → Tensor
Computes gradients for the Mean Squared Error (MSE) loss function. This is typically the starting point of backpropagation for regression tasks.
Mathematical Background
MSE Loss: L = (1/n) × Σ(yᵢ - ŷᵢ)²
Gradient: ∂L/∂ŷᵢ = (2/n) × (ŷᵢ - yᵢ)
Returns
| Type | Description |
|---|---|
| Tensor | Gradient of loss w.r.t. predictions |
Example: Simple Regression
// Simple linear regression: y = wx + b
let w = tensor([2.0]) // Weight
let b = tensor([1.0]) // Bias
// Training data
let x_train = tensor([1.0, 2.0, 3.0, 4.0, 5.0])
let y_train = tensor([3.0, 5.0, 7.0, 9.0, 11.0]) // True: y = 2x + 1
let learning_rate = 0.01
let epoch = 0
while epoch < 200 {
// Forward pass
let predictions = tensor_add(
tensor_multiply(x_train, w),
b
)
// Loss
let loss = nn_mse_loss(predictions, y_train)
// Backward pass
let grad_loss = autograd_compute_mse_grad()
// Compute weight gradient (simplified)
// In practice, chain rule applied through all operations
let grad_w = tensor_multiply(grad_loss, x_train)
// Update
let w = optim_sgd_step(w, grad_w, learning_rate)
if epoch % 50 == 0 {
print("Epoch " + str(epoch))
print(" Loss: " + str(loss))
print(" Weight: " + str(tensor_get(w, 0)))
print(" Bias: " + str(tensor_get(b, 0)))
}
let epoch = epoch + 1
}
print("Training complete!")
print("Final w: " + str(tensor_get(w, 0)) + " (target: 2.0)")
print("Final b: " + str(tensor_get(b, 0)) + " (target: 1.0)")
Complete Training Example
Here's a complete example showing all gradient functions working together:
// Two-layer neural network for regression
fn train_network() {
// Network architecture: 4 -> 8 -> 1
let w1 = tensor_randn([4, 8])
let b1 = tensor_zeros([8])
let w2 = tensor_randn([8, 1])
let b2 = tensor_zeros([1])
// Training data
let x = tensor([1.0, 2.0, 3.0, 4.0])
let target = tensor([10.0])
let learning_rate = 0.01
let epochs = 500
let epoch = 0
while epoch < epochs {
// === FORWARD PASS ===
// Layer 1: Linear + ReLU
let h1 = nn_linear(x, w1, b1)
let a1 = nn_relu(h1)
// Layer 2: Linear + Sigmoid
let h2 = nn_linear(a1, w2, b2)
let output = nn_sigmoid(h2)
// Scale output for regression
let prediction = tensor_multiply(output, tensor([20.0]))
// Loss
let loss = nn_mse_loss(prediction, target)
// === BACKWARD PASS ===
// Gradient of loss
let grad_loss = autograd_compute_mse_grad()
// Gradient through sigmoid
let grad_sigmoid = autograd_compute_sigmoid_grad()
// Gradient through second linear layer
let grad_w2 = autograd_compute_linear_grad()
// Gradient through ReLU
let grad_relu = autograd_compute_relu_grad()
// Gradient through first linear layer
let grad_w1 = autograd_compute_linear_grad()
// === UPDATE WEIGHTS ===
let w2 = optim_sgd_step(w2, grad_w2, learning_rate)
let b2 = optim_sgd_step(b2, grad_sigmoid, learning_rate)
let w1 = optim_sgd_step(w1, grad_w1, learning_rate)
let b1 = optim_sgd_step(b1, grad_relu, learning_rate)
// Print progress
if epoch % 100 == 0 {
print("Epoch " + str(epoch) + ", Loss: " + str(loss))
print("Prediction: " + str(tensor_get(prediction, 0)))
}
let epoch = epoch + 1
}
print("Training complete!")
}
train_network()
Best Practices
DO:
- Compute gradients in reverse order of forward pass
- Store intermediate activations needed for backprop
- Check for NaN or exploding gradients during training
- Use appropriate learning rates (typically 0.001 - 0.1)
- Normalize inputs to help gradient flow
- Monitor gradient magnitudes to diagnose training issues
DON'T:
- Forget to compute gradients for all trainable parameters
- Use learning rates that are too high (causes divergence)
- Ignore vanishing or exploding gradient problems
- Update weights before computing all gradients
- Mix up the order of gradient computations
Common Issues
| Problem | Symptom | Solution |
|---|---|---|
| Vanishing Gradients | Very slow learning, no progress | Use ReLU, increase learning rate, normalize inputs |
| Exploding Gradients | Loss becomes NaN or infinity | Decrease learning rate, gradient clipping, better initialization |
| Dead Neurons | ReLU outputs always 0 | Lower learning rate, better initialization, use LeakyReLU |