Optimizers

4 functions for optimizing neural network parameters

Quick Reference

Function Description
optim_sgd_step()Stochastic Gradient Descent
optim_sgd_momentum_step()SGD with momentum
optim_adam_step()Adam optimizer
tensor_clip_grad()Gradient clipping

optim_sgd_step()

optim_sgd_step(params: Tensor, grads: [float], lr: float) → Tensor

Performs a single optimization step using Stochastic Gradient Descent. Updates parameters in the direction opposite to the gradient.

Parameters

Parameter Type Description
params Tensor Parameters to update
grads [float] Gradients (same size as params)
lr float Learning rate (typically 0.001-0.1)

Returns

Updated parameters

Update Rule

θ_new = θ_old - lr * ∇θ

Examples

// Simple parameter update
let weights = tensor([1.0, 2.0, 3.0])
let gradients = [0.1, 0.2, 0.3]
let lr = 0.01

weights = optim_sgd_step(weights, gradients, lr)
// weights = [0.999, 1.998, 2.997]

// Training loop
let w = tensor_randn([10, 5])
let epochs = 100
let epoch = 0

while epoch < epochs {
    let grads = compute_gradients(w)
    w = optim_sgd_step(w, grads, 0.01)
    epoch = epoch + 1
}

Learning Rate Guidelines

Range Use Case
0.001-0.01Large networks, stable training
0.01-0.1Small networks, quick convergence
0.1-1.0Simple problems, experimentation

Best practice: Start with lr=0.01 and adjust based on training behavior.

optim_sgd_momentum_step()

optim_sgd_momentum_step(params: Tensor, grads: [float], velocity: [float], lr: float, momentum: float) → (Tensor, [float])

SGD with momentum accumulates a velocity vector to accelerate convergence and dampen oscillations.

Parameters

Parameter Type Description
params Tensor Parameters to update
grads [float] Current gradients
velocity [float] Accumulated velocity
lr float Learning rate
momentum float Momentum coefficient (typically 0.9)

Returns

Tuple of (updated_params, updated_velocity)

Update Rules

v_new = momentum * v_old + ∇θ

θ_new = θ_old - lr * v_new

Example

// Initialize
let weights = tensor_randn([10, 5])
let velocity = tensor_zeros([50]).data  // Flattened shape
let lr = 0.01
let momentum = 0.9

// Training loop
let epoch = 0
while epoch < 100 {
    let grads = compute_gradients(weights)

    // Momentum update
    let result = optim_sgd_momentum_step(weights, grads, velocity, lr, momentum)
    weights = result.0
    velocity = result.1

    epoch = epoch + 1
}

Advantage: Momentum helps escape local minima and accelerates convergence in relevant directions.

optim_adam_step()

optim_adam_step(params, grads, m, v, t, lr, beta1, beta2) → (Tensor, [float], [float])

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates. Often the best default choice for neural networks.

Parameters

Parameter Type Description
params Tensor Parameters to update
grads [float] Current gradients
m [float] First moment estimate (mean)
v [float] Second moment estimate (variance)
t int Timestep (starts at 1)
lr float Learning rate (default: 0.001)
beta1 float Exponential decay for first moment (default: 0.9)
beta2 float Exponential decay for second moment (default: 0.999)

Returns

Tuple of (updated_params, updated_m, updated_v)

Example

// Initialize Adam state
let weights = tensor_randn([100])
let m = tensor_zeros([100]).data
let v = tensor_zeros([100]).data
let t = 0

// Hyperparameters (default values)
let lr = 0.001
let beta1 = 0.9
let beta2 = 0.999

// Training loop
let epoch = 0
while epoch < 1000 {
    t = t + 1

    let grads = compute_gradients(weights)

    // Adam update
    let result = optim_adam_step(weights, grads, m, v, t, lr, beta1, beta2)
    weights = result.0
    m = result.1
    v = result.2

    epoch = epoch + 1
}

Advantages

  • Adaptive learning rates per parameter
  • Works well with sparse gradients
  • Requires less hyperparameter tuning
  • Generally faster convergence than SGD

Recommendation: Adam is often the best first choice for most neural network training tasks.

tensor_clip_grad()

tensor_clip_grad(grads: [float], max_norm: float) → [float]

Clips gradient values to prevent exploding gradients. Essential for training recurrent networks and deep networks.

Parameters

Parameter Type Description
grads [float] Gradient values to clip
max_norm float Maximum allowed norm (typically 1.0-5.0)

Returns

Clipped gradient values

Example

// Compute gradients
let grads = compute_gradients(weights)

// Clip to prevent exploding gradients
grads = tensor_clip_grad(grads, 1.0)

// Now safe to apply update
weights = optim_sgd_step(weights, grads, 0.01)

Use case: Always use gradient clipping when training RNNs, LSTMs, or very deep networks.

Optimizer Comparison

Optimizer Speed Memory Best For
SGD Fast Low Simple problems, well-tuned systems
SGD+Momentum Fast Medium CNNs, image tasks
Adam Medium High General purpose, default choice