Building a Neural Network

Complete example: Build and train a multi-layer neural network with automatic differentiation.

Overview

This tutorial demonstrates how to build a complete neural network in Charl v0.3.0 from scratch, including:

Architecture

  • Input layer: 2 features
  • Hidden layer 1: 4 neurons + ReLU
  • Hidden layer 2: 2 neurons + ReLU
  • Output layer: 1 neuron + Sigmoid

Training

  • Loss: Mean Squared Error
  • Optimizer: SGD
  • Learning rate: 0.1
  • Automatic differentiation

Complete Program

This example trains a 3-layer network on the XOR problem using automatic differentiation.

// Multi-Layer Neural Network Training

// Training data
let X = tensor([
    0.0, 0.0,
    0.0, 1.0,
    1.0, 0.0,
    1.0, 1.0
], [4, 2])

let Y = tensor([0.0, 1.0, 1.0, 0.0], [4, 1])

print("=== 3-Layer Neural Network ===")
print("")
print("Architecture: 2 -> 4 -> 2 -> 1")
print("Dataset: XOR (4 samples)")
print("")

// Layer 1: 2 -> 4
let W1 = tensor_with_grad([
    0.5, -0.3, 0.2, 0.4,
    -0.1, 0.6, 0.3, -0.2
], [2, 4])
let b1 = tensor_with_grad([0.1, -0.1, 0.2, -0.2], [4])

// Layer 2: 4 -> 2
let W2 = tensor_with_grad([
    0.4, -0.3,
    0.2, -0.1,
    0.5, 0.2,
    -0.3, 0.4
], [4, 2])
let b2 = tensor_with_grad([0.0, 0.1], [2])

// Layer 3: 2 -> 1
let W3 = tensor_with_grad([0.5, -0.3], [2, 1])
let b3 = tensor_with_grad([0.0], [1])

// Create optimizer
let optimizer = sgd_create(0.1)

print("Training for 200 epochs...")
print("")

// Training loop
let epoch = 0
while epoch < 200 {
    // Forward pass
    // Layer 1
    let h1 = nn_linear(X, W1, b1)
    let a1 = nn_relu(h1)

    // Layer 2
    let h2 = nn_linear(a1, W2, b2)
    let a2 = nn_relu(h2)

    // Layer 3 (output)
    let h3 = nn_linear(a2, W3, b3)
    let pred = nn_sigmoid(h3)

    // Compute loss
    let loss = nn_mse_loss(pred, Y)

    // Backward pass - automatic differentiation
    tensor_backward(loss)

    // Update all parameters with optimizer
    let params = [W1, b1, W2, b2, W3, b3]
    let updated = sgd_step(optimizer, params)
    W1 = updated[0]
    b1 = updated[1]
    W2 = updated[2]
    b2 = updated[3]
    W3 = updated[4]
    b3 = updated[5]

    // Print progress every 50 epochs
    if epoch % 50 == 0 {
        print("Epoch " + str(epoch) + ": Loss = " + str(tensor_item(loss)))
    }

    epoch = epoch + 1
}

print("")
print("Training complete!")
print("")

// Test the trained network
print("=== Final Predictions ===")
print("")

// Test each input
print("Input: [0, 0]")
let test1 = tensor([0.0, 0.0], [1, 2])
let out1_h1 = nn_relu(nn_linear(test1, W1, b1))
let out1_h2 = nn_relu(nn_linear(out1_h1, W2, b2))
let out1 = nn_sigmoid(nn_linear(out1_h2, W3, b3))
print("  Prediction: " + str(tensor_item(out1)) + " (expected: 0.0)")

print("Input: [0, 1]")
let test2 = tensor([0.0, 1.0], [1, 2])
let out2_h1 = nn_relu(nn_linear(test2, W1, b1))
let out2_h2 = nn_relu(nn_linear(out2_h1, W2, b2))
let out2 = nn_sigmoid(nn_linear(out2_h2, W3, b3))
print("  Prediction: " + str(tensor_item(out2)) + " (expected: 1.0)")

print("Input: [1, 0]")
let test3 = tensor([1.0, 0.0], [1, 2])
let out3_h1 = nn_relu(nn_linear(test3, W1, b1))
let out3_h2 = nn_relu(nn_linear(out3_h1, W2, b2))
let out3 = nn_sigmoid(nn_linear(out3_h2, W3, b3))
print("  Prediction: " + str(tensor_item(out3)) + " (expected: 1.0)")

print("Input: [1, 1]")
let test4 = tensor([1.0, 1.0], [1, 2])
let out4_h1 = nn_relu(nn_linear(test4, W1, b1))
let out4_h2 = nn_relu(nn_linear(out4_h1, W2, b2))
let out4 = nn_sigmoid(nn_linear(out4_h2, W3, b3))
print("  Prediction: " + str(tensor_item(out4)) + " (expected: 0.0)")

Expected Output

=== 3-Layer Neural Network ===

Architecture: 2 -> 4 -> 2 -> 1
Dataset: XOR (4 samples)

Training for 200 epochs...

Epoch 0: Loss = 0.256
Epoch 50: Loss = 0.124
Epoch 100: Loss = 0.042
Epoch 150: Loss = 0.015

Training complete!

=== Final Predictions ===

Input: [0, 0]
  Prediction: 0.023 (expected: 0.0)
Input: [0, 1]
  Prediction: 0.981 (expected: 1.0)
Input: [1, 0]
  Prediction: 0.976 (expected: 1.0)
Input: [1, 1]
  Prediction: 0.019 (expected: 0.0)

Step-by-Step Explanation

Step 1: Initialize Parameters

// Use tensor_with_grad() to enable automatic differentiation
let W1 = tensor_with_grad([...], [2, 4])
let b1 = tensor_with_grad([...], [4])

All parameters that need gradients must be created with tensor_with_grad()

Step 2: Forward Pass

// Layer by layer computation
let h1 = nn_linear(X, W1, b1)
let a1 = nn_relu(h1)
let h2 = nn_linear(a1, W2, b2)
let a2 = nn_relu(h2)
let pred = nn_sigmoid(nn_linear(a2, W3, b3))

Each layer: linear transformation followed by activation

Step 3: Compute Loss

let loss = nn_mse_loss(pred, Y)

Mean Squared Error measures prediction quality

Step 4: Backward Pass

tensor_backward(loss)

Automatically computes gradients for ALL parameters (W1, b1, W2, b2, W3, b3)

Step 5: Update Parameters

let params = [W1, b1, W2, b2, W3, b3]
let updated = sgd_step(optimizer, params)
// Reassign updated tensors
W1 = updated[0]
b1 = updated[1]
// ... etc

Optimizer uses computed gradients to update parameters

Key Concepts

Automatic Differentiation

Charl automatically builds a computation graph and computes gradients through the entire network with a single tensor_backward() call.

No manual gradient calculations needed!

Multi-Layer Networks

Stack multiple layers to learn complex non-linear patterns. Each layer transforms the input in a different way.

More layers = more representational power

Activation Functions

ReLU and Sigmoid introduce non-linearity, allowing the network to learn complex decision boundaries.

Without activations, multiple layers = one layer

Optimization

SGD optimizer iteratively adjusts parameters to minimize loss. Learning rate controls the step size.

Smaller learning rate = slower but more stable

Experiments to Try

1. Try Different Architectures

// Deeper network: 2 -> 8 -> 4 -> 2 -> 1
let W1 = tensor_with_grad([...], [2, 8])
let W2 = tensor_with_grad([...], [8, 4])
let W3 = tensor_with_grad([...], [4, 2])
let W4 = tensor_with_grad([...], [2, 1])

2. Try Adam Optimizer

let optimizer = adam_create(0.01)
let updated = adam_step(optimizer, params)

Adam often converges faster than SGD

3. Try Tanh Activation

let a1 = nn_tanh(h1)  // Instead of nn_relu(h1)

Different activations can affect learning dynamics

4. Vary Learning Rate

let optimizer = sgd_create(0.01)  // Slower
// or
let optimizer = sgd_create(0.5)   // Faster (may be unstable)

Next Steps