Optimizers

Function	Description
optim_sgd_step()	Stochastic Gradient Descent
optim_sgd_momentum_step()	SGD with momentum
optim_adam_step()	Adam optimizer
tensor_clip_grad()	Gradient clipping

optim_sgd_step()

optim_sgd_step(params: Tensor, grads: [float], lr: float) → Tensor

Performs a single optimization step using Stochastic Gradient Descent. Updates parameters in the direction opposite to the gradient.

Parameters

Parameter	Type	Description
params	Tensor	Parameters to update
grads	[float]	Gradients (same size as params)
lr	float	Learning rate (typically 0.001-0.1)

Returns

Updated parameters

Update Rule

θ_new = θ_old - lr * ∇θ

Examples

// Simple parameter update
let weights = tensor([1.0, 2.0, 3.0])
let gradients = [0.1, 0.2, 0.3]
let lr = 0.01

weights = optim_sgd_step(weights, gradients, lr)
// weights = [0.999, 1.998, 2.997]

// Training loop
let w = tensor_randn([10, 5])
let epochs = 100
let epoch = 0

while epoch < epochs {
    let grads = compute_gradients(w)
    w = optim_sgd_step(w, grads, 0.01)
    epoch = epoch + 1
}

Learning Rate Guidelines

Range	Use Case
0.001-0.01	Large networks, stable training
0.01-0.1	Small networks, quick convergence
0.1-1.0	Simple problems, experimentation

Best practice: Start with lr=0.01 and adjust based on training behavior.

optim_sgd_momentum_step()

optim_sgd_momentum_step(params: Tensor, grads: [float], velocity: [float], lr: float, momentum: float) → (Tensor, [float])

SGD with momentum accumulates a velocity vector to accelerate convergence and dampen oscillations.

Parameters

Parameter	Type	Description
params	Tensor	Parameters to update
grads	[float]	Current gradients
velocity	[float]	Accumulated velocity
lr	float	Learning rate
momentum	float	Momentum coefficient (typically 0.9)

Returns

Tuple of (updated_params, updated_velocity)

Update Rules

v_new = momentum * v_old + ∇θ

θ_new = θ_old - lr * v_new

Example

// Initialize
let weights = tensor_randn([10, 5])
let velocity = tensor_zeros([50]).data  // Flattened shape
let lr = 0.01
let momentum = 0.9

// Training loop
let epoch = 0
while epoch < 100 {
    let grads = compute_gradients(weights)

    // Momentum update
    let result = optim_sgd_momentum_step(weights, grads, velocity, lr, momentum)
    weights = result.0
    velocity = result.1

    epoch = epoch + 1
}

Advantage: Momentum helps escape local minima and accelerates convergence in relevant directions.

optim_adam_step()

optim_adam_step(params, grads, m, v, t, lr, beta1, beta2) → (Tensor, [float], [float])

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates. Often the best default choice for neural networks.

Parameters

Parameter	Type	Description
params	Tensor	Parameters to update
grads	[float]	Current gradients
m	[float]	First moment estimate (mean)
v	[float]	Second moment estimate (variance)
t	int	Timestep (starts at 1)
lr	float	Learning rate (default: 0.001)
beta1	float	Exponential decay for first moment (default: 0.9)
beta2	float	Exponential decay for second moment (default: 0.999)

Returns

Tuple of (updated_params, updated_m, updated_v)

Example

// Initialize Adam state
let weights = tensor_randn([100])
let m = tensor_zeros([100]).data
let v = tensor_zeros([100]).data
let t = 0

// Hyperparameters (default values)
let lr = 0.001
let beta1 = 0.9
let beta2 = 0.999

// Training loop
let epoch = 0
while epoch < 1000 {
    t = t + 1

    let grads = compute_gradients(weights)

    // Adam update
    let result = optim_adam_step(weights, grads, m, v, t, lr, beta1, beta2)
    weights = result.0
    m = result.1
    v = result.2

    epoch = epoch + 1
}

Advantages

Adaptive learning rates per parameter
Works well with sparse gradients
Requires less hyperparameter tuning
Generally faster convergence than SGD

Recommendation: Adam is often the best first choice for most neural network training tasks.

tensor_clip_grad()

tensor_clip_grad(grads: [float], max_norm: float) → [float]

Clips gradient values to prevent exploding gradients. Essential for training recurrent networks and deep networks.

Parameters

Parameter	Type	Description
grads	[float]	Gradient values to clip
max_norm	float	Maximum allowed norm (typically 1.0-5.0)

Returns

Clipped gradient values

Example

// Compute gradients
let grads = compute_gradients(weights)

// Clip to prevent exploding gradients
grads = tensor_clip_grad(grads, 1.0)

// Now safe to apply update
weights = optim_sgd_step(weights, grads, 0.01)

Use case: Always use gradient clipping when training RNNs, LSTMs, or very deep networks.

Optimizer Comparison

Optimizer	Speed	Memory	Best For
SGD	Fast	Low	Simple problems, well-tuned systems
SGD+Momentum	Fast	Medium	CNNs, image tasks
Adam	Medium	High	General purpose, default choice

Quick Reference

optim_sgd_step()

Parameters

Returns

Update Rule

Examples

Learning Rate Guidelines

optim_sgd_momentum_step()

Parameters

Returns

Update Rules

Example

optim_adam_step()

Parameters

Returns

Example

Advantages

tensor_clip_grad()

Parameters

Returns

Example

Optimizer Comparison