Optimizers
4 functions for optimizing neural network parameters
Quick Reference
| Function | Description |
|---|---|
| optim_sgd_step() | Stochastic Gradient Descent |
| optim_sgd_momentum_step() | SGD with momentum |
| optim_adam_step() | Adam optimizer |
| tensor_clip_grad() | Gradient clipping |
optim_sgd_step()
optim_sgd_step(params: Tensor, grads: [float], lr: float) → Tensor
Performs a single optimization step using Stochastic Gradient Descent. Updates parameters in the direction opposite to the gradient.
Parameters
| Parameter | Type | Description |
|---|---|---|
| params | Tensor | Parameters to update |
| grads | [float] | Gradients (same size as params) |
| lr | float | Learning rate (typically 0.001-0.1) |
Returns
Updated parameters
Update Rule
θ_new = θ_old - lr * ∇θ
Examples
// Simple parameter update
let weights = tensor([1.0, 2.0, 3.0])
let gradients = [0.1, 0.2, 0.3]
let lr = 0.01
weights = optim_sgd_step(weights, gradients, lr)
// weights = [0.999, 1.998, 2.997]
// Training loop
let w = tensor_randn([10, 5])
let epochs = 100
let epoch = 0
while epoch < epochs {
let grads = compute_gradients(w)
w = optim_sgd_step(w, grads, 0.01)
epoch = epoch + 1
}
Learning Rate Guidelines
| Range | Use Case |
|---|---|
| 0.001-0.01 | Large networks, stable training |
| 0.01-0.1 | Small networks, quick convergence |
| 0.1-1.0 | Simple problems, experimentation |
Best practice: Start with lr=0.01 and adjust based on training behavior.
optim_sgd_momentum_step()
optim_sgd_momentum_step(params: Tensor, grads: [float], velocity: [float], lr: float, momentum: float) → (Tensor, [float])
SGD with momentum accumulates a velocity vector to accelerate convergence and dampen oscillations.
Parameters
| Parameter | Type | Description |
|---|---|---|
| params | Tensor | Parameters to update |
| grads | [float] | Current gradients |
| velocity | [float] | Accumulated velocity |
| lr | float | Learning rate |
| momentum | float | Momentum coefficient (typically 0.9) |
Returns
Tuple of (updated_params, updated_velocity)
Update Rules
v_new = momentum * v_old + ∇θ
θ_new = θ_old - lr * v_new
Example
// Initialize
let weights = tensor_randn([10, 5])
let velocity = tensor_zeros([50]).data // Flattened shape
let lr = 0.01
let momentum = 0.9
// Training loop
let epoch = 0
while epoch < 100 {
let grads = compute_gradients(weights)
// Momentum update
let result = optim_sgd_momentum_step(weights, grads, velocity, lr, momentum)
weights = result.0
velocity = result.1
epoch = epoch + 1
}
Advantage: Momentum helps escape local minima and accelerates convergence in relevant directions.
optim_adam_step()
optim_adam_step(params, grads, m, v, t, lr, beta1, beta2) → (Tensor, [float], [float])
Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates. Often the best default choice for neural networks.
Parameters
| Parameter | Type | Description |
|---|---|---|
| params | Tensor | Parameters to update |
| grads | [float] | Current gradients |
| m | [float] | First moment estimate (mean) |
| v | [float] | Second moment estimate (variance) |
| t | int | Timestep (starts at 1) |
| lr | float | Learning rate (default: 0.001) |
| beta1 | float | Exponential decay for first moment (default: 0.9) |
| beta2 | float | Exponential decay for second moment (default: 0.999) |
Returns
Tuple of (updated_params, updated_m, updated_v)
Example
// Initialize Adam state
let weights = tensor_randn([100])
let m = tensor_zeros([100]).data
let v = tensor_zeros([100]).data
let t = 0
// Hyperparameters (default values)
let lr = 0.001
let beta1 = 0.9
let beta2 = 0.999
// Training loop
let epoch = 0
while epoch < 1000 {
t = t + 1
let grads = compute_gradients(weights)
// Adam update
let result = optim_adam_step(weights, grads, m, v, t, lr, beta1, beta2)
weights = result.0
m = result.1
v = result.2
epoch = epoch + 1
}
Advantages
- Adaptive learning rates per parameter
- Works well with sparse gradients
- Requires less hyperparameter tuning
- Generally faster convergence than SGD
Recommendation: Adam is often the best first choice for most neural network training tasks.
tensor_clip_grad()
tensor_clip_grad(grads: [float], max_norm: float) → [float]
Clips gradient values to prevent exploding gradients. Essential for training recurrent networks and deep networks.
Parameters
| Parameter | Type | Description |
|---|---|---|
| grads | [float] | Gradient values to clip |
| max_norm | float | Maximum allowed norm (typically 1.0-5.0) |
Returns
Clipped gradient values
Example
// Compute gradients
let grads = compute_gradients(weights)
// Clip to prevent exploding gradients
grads = tensor_clip_grad(grads, 1.0)
// Now safe to apply update
weights = optim_sgd_step(weights, grads, 0.01)
Use case: Always use gradient clipping when training RNNs, LSTMs, or very deep networks.
Optimizer Comparison
| Optimizer | Speed | Memory | Best For |
|---|---|---|---|
| SGD | Fast | Low | Simple problems, well-tuned systems |
| SGD+Momentum | Fast | Medium | CNNs, image tasks |
| Adam | Medium | High | General purpose, default choice |