Backpropagation Functions

4 gradient computation functions for automatic differentiation and backpropagation.

Overview

Backpropagation is the algorithm used to train neural networks by computing gradients of the loss function with respect to model parameters. Charl provides automatic differentiation (autograd) helper functions for common operations.

Key Concept: Gradients indicate how to adjust weights to reduce loss. These functions compute those gradients automatically.

Training Flow

  1. Forward Pass: Compute predictions from inputs
  2. Calculate Loss: Compare predictions to targets
  3. Backward Pass: Compute gradients using autograd functions
  4. Update Weights: Adjust parameters using optimizer
  5. Repeat: Iterate until convergence

Quick Reference

Function Description
autograd_compute_linear_grad() Gradients for linear layers
autograd_compute_relu_grad() Gradients for ReLU activation
autograd_compute_sigmoid_grad() Gradients for sigmoid activation
autograd_compute_mse_grad() Gradients for MSE loss

autograd_compute_linear_grad()

autograd_compute_linear_grad() → Tensor

Computes gradients for a linear (fully connected) layer. This function calculates how the loss changes with respect to the weights and biases of a linear transformation.

Mathematical Background

For linear layer: y = xW + b

  • Gradient w.r.t. weights: ∂L/∂W = xᵀ × ∂L/∂y
  • Gradient w.r.t. bias: ∂L/∂b = ∂L/∂y
  • Gradient w.r.t. input: ∂L/∂x = ∂L/∂y × Wᵀ

Returns

Type Description
Tensor Gradient tensor for weight updates

Example: Training Linear Layer

// Initialize layer parameters
let input_size = 4
let output_size = 2

let weights = tensor_randn([input_size, output_size])
let bias = tensor_zeros([output_size])

// Training data
let x = tensor([1.0, 2.0, 3.0, 4.0])
let target = tensor([1.0, 0.0])

let learning_rate = 0.01
let epochs = 100
let epoch = 0

while epoch < epochs {
    // Forward pass
    let output = nn_linear(x, weights, bias)
    let predictions = nn_sigmoid(output)

    // Calculate loss
    let loss = nn_mse_loss(predictions, target)

    // Backward pass - compute gradients
    let grad_weights = autograd_compute_linear_grad()

    // Update weights
    let weights = optim_sgd_step(weights, grad_weights, learning_rate)

    if epoch % 20 == 0 {
        print("Epoch " + str(epoch) + ", Loss: " + str(loss))
    }

    let epoch = epoch + 1
}

print("Training complete!")

autograd_compute_relu_grad()

autograd_compute_relu_grad() → Tensor

Computes gradients for the ReLU (Rectified Linear Unit) activation function. ReLU is defined as max(0, x).

Mathematical Background

ReLU function: f(x) = max(0, x)

Derivative:

  • ∂f/∂x = 1 if x > 0
  • ∂f/∂x = 0 if x ≤ 0

Returns

Type Description
Tensor Gradient tensor (1 for positive, 0 for negative)

Example: Network with ReLU

// Two-layer network with ReLU
let w1 = tensor_randn([4, 8])
let b1 = tensor_zeros([8])
let w2 = tensor_randn([8, 1])
let b2 = tensor_zeros([1])

let x = tensor([1.0, 2.0, 3.0, 4.0])
let target = tensor([5.0])

let learning_rate = 0.01
let epoch = 0

while epoch < 50 {
    // Forward pass
    let h1 = nn_linear(x, w1, b1)
    let a1 = nn_relu(h1)  // ReLU activation
    let output = nn_linear(a1, w2, b2)

    // Loss
    let loss = nn_mse_loss(output, target)

    // Backward pass
    // Gradient through second layer
    let grad_w2 = autograd_compute_linear_grad()

    // Gradient through ReLU
    let grad_relu = autograd_compute_relu_grad()

    // Gradient through first layer
    let grad_w1 = autograd_compute_linear_grad()

    // Update weights
    let w2 = optim_sgd_step(w2, grad_w2, learning_rate)
    let w1 = optim_sgd_step(w1, grad_w1, learning_rate)

    if epoch % 10 == 0 {
        print("Epoch " + str(epoch) + ", Loss: " + str(loss))
    }

    let epoch = epoch + 1
}

Dead ReLU Problem: Neurons with ReLU can "die" during training if they always output 0. Consider using learning rate scheduling or other activation functions if this occurs.

autograd_compute_sigmoid_grad()

autograd_compute_sigmoid_grad() → Tensor

Computes gradients for the sigmoid activation function. Sigmoid squashes values to the range [0, 1].

Mathematical Background

Sigmoid function: σ(x) = 1 / (1 + e⁻ˣ)

Derivative: ∂σ/∂x = σ(x) × (1 - σ(x))

Returns

Type Description
Tensor Gradient tensor based on sigmoid derivative

Example: Binary Classification

// Binary classifier with sigmoid output
let weights = tensor_randn([4, 1])
let bias = tensor_zeros([1])

let x = tensor([0.5, 1.0, 0.8, 0.3])
let target = tensor([1.0])  // Binary label

let learning_rate = 0.1
let epoch = 0

while epoch < 100 {
    // Forward pass
    let logits = nn_linear(x, weights, bias)
    let prediction = nn_sigmoid(logits)

    // Loss
    let loss = nn_cross_entropy_loss(prediction, target)

    // Backward pass
    let grad_sigmoid = autograd_compute_sigmoid_grad()
    let grad_weights = autograd_compute_linear_grad()

    // Update
    let weights = optim_sgd_step(weights, grad_weights, learning_rate)

    if epoch % 20 == 0 {
        print("Epoch " + str(epoch) + ", Loss: " + str(loss))
        print("Prediction: " + str(tensor_get(prediction, 0)))
    }

    let epoch = epoch + 1
}

// Final prediction
let final_logits = nn_linear(x, weights, bias)
let final_pred = nn_sigmoid(final_logits)
print("Final prediction: " + str(tensor_get(final_pred, 0)))

Use Case: Sigmoid is commonly used as the final activation for binary classification tasks. For multi-class problems, use softmax instead.

autograd_compute_mse_grad()

autograd_compute_mse_grad() → Tensor

Computes gradients for the Mean Squared Error (MSE) loss function. This is typically the starting point of backpropagation for regression tasks.

Mathematical Background

MSE Loss: L = (1/n) × Σ(yᵢ - ŷᵢ)²

Gradient: ∂L/∂ŷᵢ = (2/n) × (ŷᵢ - yᵢ)

Returns

Type Description
Tensor Gradient of loss w.r.t. predictions

Example: Simple Regression

// Simple linear regression: y = wx + b
let w = tensor([2.0])  // Weight
let b = tensor([1.0])  // Bias

// Training data
let x_train = tensor([1.0, 2.0, 3.0, 4.0, 5.0])
let y_train = tensor([3.0, 5.0, 7.0, 9.0, 11.0])  // True: y = 2x + 1

let learning_rate = 0.01
let epoch = 0

while epoch < 200 {
    // Forward pass
    let predictions = tensor_add(
        tensor_multiply(x_train, w),
        b
    )

    // Loss
    let loss = nn_mse_loss(predictions, y_train)

    // Backward pass
    let grad_loss = autograd_compute_mse_grad()

    // Compute weight gradient (simplified)
    // In practice, chain rule applied through all operations
    let grad_w = tensor_multiply(grad_loss, x_train)

    // Update
    let w = optim_sgd_step(w, grad_w, learning_rate)

    if epoch % 50 == 0 {
        print("Epoch " + str(epoch))
        print("  Loss: " + str(loss))
        print("  Weight: " + str(tensor_get(w, 0)))
        print("  Bias: " + str(tensor_get(b, 0)))
    }

    let epoch = epoch + 1
}

print("Training complete!")
print("Final w: " + str(tensor_get(w, 0)) + " (target: 2.0)")
print("Final b: " + str(tensor_get(b, 0)) + " (target: 1.0)")

Complete Training Example

Here's a complete example showing all gradient functions working together:

// Two-layer neural network for regression
fn train_network() {
    // Network architecture: 4 -> 8 -> 1
    let w1 = tensor_randn([4, 8])
    let b1 = tensor_zeros([8])
    let w2 = tensor_randn([8, 1])
    let b2 = tensor_zeros([1])

    // Training data
    let x = tensor([1.0, 2.0, 3.0, 4.0])
    let target = tensor([10.0])

    let learning_rate = 0.01
    let epochs = 500
    let epoch = 0

    while epoch < epochs {
        // === FORWARD PASS ===
        // Layer 1: Linear + ReLU
        let h1 = nn_linear(x, w1, b1)
        let a1 = nn_relu(h1)

        // Layer 2: Linear + Sigmoid
        let h2 = nn_linear(a1, w2, b2)
        let output = nn_sigmoid(h2)

        // Scale output for regression
        let prediction = tensor_multiply(output, tensor([20.0]))

        // Loss
        let loss = nn_mse_loss(prediction, target)

        // === BACKWARD PASS ===
        // Gradient of loss
        let grad_loss = autograd_compute_mse_grad()

        // Gradient through sigmoid
        let grad_sigmoid = autograd_compute_sigmoid_grad()

        // Gradient through second linear layer
        let grad_w2 = autograd_compute_linear_grad()

        // Gradient through ReLU
        let grad_relu = autograd_compute_relu_grad()

        // Gradient through first linear layer
        let grad_w1 = autograd_compute_linear_grad()

        // === UPDATE WEIGHTS ===
        let w2 = optim_sgd_step(w2, grad_w2, learning_rate)
        let b2 = optim_sgd_step(b2, grad_sigmoid, learning_rate)
        let w1 = optim_sgd_step(w1, grad_w1, learning_rate)
        let b1 = optim_sgd_step(b1, grad_relu, learning_rate)

        // Print progress
        if epoch % 100 == 0 {
            print("Epoch " + str(epoch) + ", Loss: " + str(loss))
            print("Prediction: " + str(tensor_get(prediction, 0)))
        }

        let epoch = epoch + 1
    }

    print("Training complete!")
}

train_network()

Best Practices

DO:

  • Compute gradients in reverse order of forward pass
  • Store intermediate activations needed for backprop
  • Check for NaN or exploding gradients during training
  • Use appropriate learning rates (typically 0.001 - 0.1)
  • Normalize inputs to help gradient flow
  • Monitor gradient magnitudes to diagnose training issues

DON'T:

  • Forget to compute gradients for all trainable parameters
  • Use learning rates that are too high (causes divergence)
  • Ignore vanishing or exploding gradient problems
  • Update weights before computing all gradients
  • Mix up the order of gradient computations

Common Issues

Problem Symptom Solution
Vanishing Gradients Very slow learning, no progress Use ReLU, increase learning rate, normalize inputs
Exploding Gradients Loss becomes NaN or infinity Decrease learning rate, gradient clipping, better initialization
Dead Neurons ReLU outputs always 0 Lower learning rate, better initialization, use LeakyReLU

Related Topics