Introduction to Deep Learning

From Linear Regression to Neural Networks


ML for Science and Engineering — Lecture 11
Joseph Bakarji

Why Does Deep Learning Work?

The success of deep learning reveals two underappreciated principles about intelligence:

  • Self-organization: intelligence emerges from simple elements that learn to coordinate
  • Data-driven adaptation: enabling a system to learn from its environment is more powerful than encoding rules
Core Insight
Deep networks are collections of simple elements that collaborate to process inputs and predict outputs. Their power comes not from any single element, but from their collective adaptation to data.

Rules vs. the Ability to Learn

Rule-Based Approach

  • Encode expert knowledge as explicit rules
  • Define what to do in each situation
  • Brittle when the environment changes
  • Cannot adapt to unseen contexts

Learning-Based Approach

  • Give the system the ability to learn
  • Let it adapt to its environment autonomously
  • Flexible in complex, unpredictable settings
  • More powerful when you cannot predict every case
Do you teach your children rigid rules for every situation, or do you give them the ability to adapt to any context they might encounter?

The Intellectual Milieu of the 1940s

An extraordinary convergence of ideas set the stage for artificial intelligence:

  • Russell & Whitehead showed in Principia Mathematica (1910-1913) that mathematics derives from formal logic. If logic is universal, and machines can perform logic...
  • Von Neumann designed the stored-program computer (EDVAC, 1945), using McCulloch-Pitts neurons as the conceptual model.
  • Norbert Wiener founded cybernetics (1948): feedback, control, and communication in animals and machines.
The reasoning:

Logic can express any computation.
Neurons perform logical operations.
Therefore, networks of neurons can compute anything.

This launched AI as a field.

Walter Pitts and the Dream of Logical Intelligence

Walter Pitts was a self-taught, homeless teenager from Detroit who, at age 12, walked into a lecture by Bertrand Russell and found errors in Principia Mathematica. Russell invited Pitts to study with him.

Pitts met Warren McCulloch, a neurophysiologist asking: how does the brain compute? They combined propositional logic with neural anatomy to create the first formal neuron model.

Their 1943 paper showed that networks of binary threshold units could implement any logical function, founding both neural networks and digital computing theory.

The Macy Conferences (1946-1953)
McCulloch chaired interdisciplinary meetings with von Neumann, Wiener, Pitts, Bateson, and Margaret Mead. Here cybernetics crystallized, and the ideas that would become AI, information theory, and cognitive science were debated.

Macy Conferences · The Story of Walter Pitts (Nautilus)

The McCulloch-Pitts Neuron (1943)

A binary threshold unit:

\( y = \begin{cases} 1 & \text{if } \sum_i w_i x_i \geq \theta \\ 0 & \text{otherwise} \end{cases} \)

Click inputs to toggle. Use sliders or gate presets below.

From Theory to Biology: The Squid Giant Axon

While McCulloch-Pitts modeled neurons as logical abstractions, experimentalists studied the real thing. The squid giant axon (1000x thicker than human axons) was large enough to insert electrodes inside.

1939 Cole & Curtis develop voltage clamp: hold membrane potential fixed, measure current.
1952 Hodgkin & Huxley fit ODEs to voltage clamp data with gating variables \(m, h, n\) governed by first-order kinetics.
1963 Nobel Prize for Hodgkin, Huxley, and Eccles.
A triumph of scientific modeling: Hodgkin and Huxley had no computers. They solved the ODEs by hand with a desk calculator, predicting the action potential shape before it could be measured at full resolution.

Historical Perspective (PMC)

The Hodgkin-Huxley Model

\( C_m \frac{dV}{dt} = I_{\text{ext}} - g_{Na} m^3 h (V - E_{Na}) - g_K n^4 (V - E_K) - g_L (V - E_L) \)

The gating variables \(m, h, n \in [0,1]\) model ion channel opening probabilities. Na\(^+\) channels need 3 activation gates (\(m^3\)) and 1 inactivation gate (\(h\)); K\(^+\) channels need 4 gates (\(n^4\)). Each satisfies:

\( \frac{dx}{dt} = \alpha_x(V)(1-x) - \beta_x(V)x \)

where \(\alpha_x, \beta_x\) are voltage-dependent rate functions that H&H fit to experimental data as combinations of exponentials and rationals, e.g.:

\( \alpha_m(V) = \frac{0.1(V+40)}{1 - e^{-(V+40)/10}} \)

Inject current:

Rate functions \(\alpha_m, \beta_m\):

Two Views of the Neuron

McCulloch-Pitts (1943)Hodgkin-Huxley (1952)
NatureComputational / LogicalDynamical / Physical
OutputBinary: 0 or 1Continuous voltage \(V(t)\)
TimeInstantaneousEvolves via coupled ODEs
ParametersWeights \(w_i\), threshold \(\theta\)Conductances, Nernst potentials, gating kinetics
DiscoveryTheoretical construction from logicFit from voltage clamp experiments
LegacyNeural networks, deep learning, AIComputational neuroscience, biophysics
Modern deep learning descends from the McCulloch-Pitts tradition: simplified, computational units whose power comes from collective self-organization, not biological fidelity. The architecture matters less than the ability to adapt.

From Linear Models
to Deep Networks

Building the mathematical foundations

The Linear Hypothesis

The simplest supervised learning model assumes a linear relationship between inputs and outputs:

\( \hat{y} = f_\mathbf{w}(\mathbf{x}) = \mathbf{w} \cdot \boldsymbol{\phi}(\mathbf{x}) = \sum_{j=1}^p w_j \phi_j(\mathbf{x}) \)

The feature vector \(\boldsymbol{\phi}(\mathbf{x})\) encodes our assumptions about the data:

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x]\)   (linear)

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x, x^2, x^3]\)   (polynomial)

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x, \sin(3x)]\)   (custom)

\(\boldsymbol{\phi}(\mathbf{x}) = \;?\;?\;?\)   (what if we could learn this?)

Recall: We split data into training, validation, and test sets to select features and evaluate generalization. The question remains: how do we choose \(\boldsymbol{\phi}\)?

Neural networks answer this question by learning the features directly from data.

Linear Predictor: Network View

For multiple inputs and outputs:

\( \hat{\mathbf{y}} = W\mathbf{x} + \mathbf{b} \)

In index notation:

\( \hat{y}_i = \sum_{j=1}^n W_{ij} x_j + b_i \)

where \(W \in \mathbb{R}^{m \times n}\), \(\mathbf{b} \in \mathbb{R}^m\).

Each output is a weighted sum of all inputs. The network diagram makes this structure visible:

Activation Functions

To introduce nonlinearity, we pass the linear output through an activation function \(f\):

\( \hat{\mathbf{y}} = f(W\mathbf{x} + \mathbf{b}) \)

The Neural Network

A single hidden layer:

\( \hat{\mathbf{y}} = f\bigl(W_2 \;f(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2\bigr) \)
  • \(\mathbf{v} = f(W_1 \mathbf{x} + \mathbf{b}_1)\) — hidden activations
  • \(\hat{\mathbf{y}} = f(W_2 \mathbf{v} + \mathbf{b}_2)\) — output
  • The hidden layer learns a feature representation \(\mathbf{v} = \phi(\mathbf{x})\)
Instead of hand-crafting \(\phi(\mathbf{x})\), the network learns it from data.

Going Deep

Stack multiple hidden layers to build hierarchical representations:

\( \hat{\mathbf{y}} = f\bigl(W_L \;f(\cdots f(W_2 \;f(W_1 \mathbf{x}))\cdots)\bigr) \)

\(\mathbf{v}^{(1)} = f(W_1\mathbf{x} + \mathbf{b}_1)\)

\(\mathbf{v}^{(2)} = f(W_2\mathbf{v}^{(1)} + \mathbf{b}_2)\)

\(\mathbf{v}^{(3)} = f(W_3\mathbf{v}^{(2)} + \mathbf{b}_3)\)

\(\hat{\mathbf{y}} = f(W_4\mathbf{v}^{(3)} + \mathbf{b}_4)\)

Why Depth? Hierarchical Feature Learning

Deep networks learn features at increasing levels of abstraction, from simple edges to complex concepts.

Layer 1

Edges, gradients,
simple patterns

Layer 2-3

Textures, parts,
local combinations

Deeper Layers

Objects, faces,
semantic concepts

Reference: Lee et al., "Unsupervised learning of hierarchical representations with convolutional deep belief networks," Communications of the ACM (2011).

Training Neural Networks

Loss functions, gradients, and backpropagation

Loss Functions

Regression (MSE)

\( \mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n \|\hat{\mathbf{y}}^{(i)} - \mathbf{y}^{(i)}\|^2 \)

Squared distance between predictions and targets. Differentiable everywhere, standard for continuous outputs.

Classification (Binary Cross-Entropy)

\( \mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n \bigl[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\bigr] \)

Measures how well predicted probabilities match binary labels. Derived from maximum likelihood of Bernoulli distribution.

Objective: Find weights \(\hat{W} = \arg\min_{W} \mathcal{L}(W)\). No closed-form solution for neural networks — we use gradient descent.

Gradient Descent

Update rule for each parameter matrix:

\( W \leftarrow W - \alpha \frac{\partial \mathcal{L}}{\partial W} \)
  • \(\alpha\) is the learning rate
  • The gradient \(\nabla_W \mathcal{L}\) points toward steepest increase
  • We step in the opposite direction
Non-convexity: Neural network loss surfaces have many local minima and saddle points.

Loss landscape \(\mathcal{L}(w_1, w_2)\) — dark = low loss (minimum)

0.10

Backpropagation

The chain rule applied layer by layer, reusing intermediate gradients:

\( \frac{\partial \mathcal{L}}{\partial W^{[k]}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{z}^{[L]}} \cdot \frac{\partial \mathbf{z}^{[L]}}{\partial \mathbf{v}^{[L-1]}} \cdots \frac{\partial \mathbf{z}^{[k]}}{\partial W^{[k]}} \)

Forward Pass

\(\mathbf{z}^{[l]} = W^{[l]}\mathbf{v}^{[l-1]} + \mathbf{b}^{[l]}\)

\(\mathbf{v}^{[l]} = f(\mathbf{z}^{[l]})\)

Compute and store each layer's activations.

Backward Pass

Start: \(\delta^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}}\)

Propagate: \(\delta^{[l]} = (W^{[l+1]})^\top \delta^{[l+1]} \odot f'(\mathbf{z}^{[l]})\)

Gradients: \(\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]}(\mathbf{v}^{[l-1]})^\top\)

Backpropagation computes all gradients in one backward sweep, with cost proportional to the forward pass.

Backpropagation: Concrete Example

A 1-hidden-layer network: \(x \xrightarrow{w_1} v = f(w_1 x + b_1) \xrightarrow{w_2} \hat{y} = w_2 v + b_2\). Let \(f = \text{ReLU}\), \(x = 2\), \(y = 1\).

Forward Pass

\(z_1 = w_1 x + b_1 = 0.5 \cdot 2 + 0.1 = 1.1\)

\(v = \text{ReLU}(1.1) = 1.1\)

\(\hat{y} = w_2 v + b_2 = 0.8 \cdot 1.1 + 0 = 0.88\)

\(\mathcal{L} = (\hat{y} - y)^2 = (0.88 - 1)^2 = 0.0144\)

Backward Pass

\(\frac{\partial \mathcal{L}}{\partial \hat{y}} = 2(\hat{y} - y) = -0.24\)

\(\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot v = -0.24 \cdot 1.1 = -0.264\)

\(\frac{\partial \mathcal{L}}{\partial v} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot w_2 = -0.24 \cdot 0.8 = -0.192\)

\(\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial v} \cdot f'(z_1) \cdot x = -0.192 \cdot 1 \cdot 2 = -0.384\)

SGD update (α = 0.1): \(w_1 \leftarrow 0.5 - 0.1(-0.384) = 0.538\), \(w_2 \leftarrow 0.8 - 0.1(-0.264) = 0.826\). Each step moves weights to reduce loss.

Stochastic Gradient Descent and Variants

MethodUpdate Rule
SGD\(W \leftarrow W - \alpha \nabla_W \mathcal{L}_{\text{batch}}\)
Momentum\(v \leftarrow \beta v + \nabla \mathcal{L}\), \(W \leftarrow W - \alpha v\)
AdamAdaptive learning rates per parameter
Mini-batch SGD: Instead of computing the gradient over the entire dataset, use a random subset (batch) at each step. Faster, and the noise helps escape local minima.

Training in Practice: PyTorch


import torch
import torch.nn as nn
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 20)
        self.fc2 = nn.Linear(20, 15)
        self.fc3 = nn.Linear(15, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

net = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

for epoch in range(5000):
    output = net(x_train)
    loss = criterion(output, y_train)
    loss.backward()          # Backpropagation
    optimizer.step()         # Update weights
    optimizer.zero_grad()    # Reset gradients
    

Universal Approximation in Action

A single hidden layer with enough neurons can approximate any continuous function (Cybenko, 1989; Hornik et al., 1989). The network trains via the SGD loop above. Increase neurons to improve the fit:

3
Architecture:
1 → N → 1 (sigmoid activations)

Training: 800 epochs of SGD on 30 sample points, LR = 0.005

Loss (MSE):

Overfitting and Regularization

When a model memorizes training data but fails on new data:

Regularization Strategies

  • Weight decay (L2): \(\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda\|W\|^2\)
  • Dropout: randomly zero out neurons during training
  • Early stopping: stop when validation loss starts increasing
  • Data augmentation: increase effective dataset size

# Weight decay in PyTorch
optimizer = optim.Adam(net.parameters(),
                       lr=0.01, weight_decay=1e-3)
            

Beyond the Basics

Architectures that encode structure

The Neural Network Zoo

Different architectures encode different inductive biases about the structure of data:

ArchitectureBias / AssumptionApplication
Fully Connected (MLP)No structure assumedTabular data, function approximation
Convolutional (CNN)Spatial locality, translation invarianceImages, spatial data
Recurrent (RNN)Sequential dependenceTime series, language
AutoencoderLow-dimensional latent structureCompression, denoising
TransformerAttention over all positionsLanguage, vision, multimodal

Auto-Encoders

Learn a compressed representation by training the network to reconstruct its own input. The bottleneck forces the network to discover the most important features.

\( \mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2, \quad \hat{\mathbf{x}} = g_\theta(\underbrace{f_\phi(\mathbf{x})}_{\mathbf{z}}) \)
  • Encoder: \(\mathbf{z} = f_\phi(\mathbf{x}) \in \mathbb{R}^d\) — compressed representation
  • Decoder: \(\hat{\mathbf{x}} = g_\theta(\mathbf{z}) \in \mathbb{R}^n\) — reconstruction
  • If \(f, g\) are linear, the AE learns the SVD: same subspace as PCA
Example: MNIST denoising. Train on corrupted digits, reconstruct clean ones. The 2D latent space \(\mathbf{z} \in \mathbb{R}^2\) clusters digits by identity — the network discovers digit features without labels.

Convolutional Neural Networks

Instead of connecting every input to every hidden unit, CNNs use local filters (kernels) that slide across the input. A 2D convolution:

\( (\mathbf{K} * \mathbf{X})_{ij} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{mn} \cdot X_{i+m,\, j+n} \)
  • Convolution: slide a \(k \times k\) kernel across the image. Same kernel everywhere → weight sharing
  • Pooling (max or average): reduce spatial dimensions, e.g. \(2 \times 2\) max-pool halves width and height
  • Stack: Conv → ReLU → Pool → Conv → ReLU → Pool → Flatten → FC
Parameter savings: A 5×5 kernel on a 28×28 image has 25 weights, vs. 784×hidden for a fully connected layer. Translation invariance is built in.

Recurrent Neural Networks

Process sequences by maintaining a hidden state that carries information across time steps:

\( \mathbf{h}_{t+1} = f(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b}_h) \)
\( \hat{\mathbf{y}}_t = W_y \mathbf{h}_t + \mathbf{b}_y \)
  • \(\mathbf{x}_t\) — input at time \(t\); \(\mathbf{h}_t\) — hidden state (memory)
  • \(W_h, W_x, W_y\) — shared across all time steps
  • \(\hat{\mathbf{y}}_t\) — prediction at time \(t\) (e.g. next value \(\mathbf{x}_{t+1}\))
ODE connection: Euler's method \(\mathbf{y}_{n+1} = \mathbf{y}_n + h\,f(\mathbf{y}_n)\) is a recurrence with fixed "weights." RNNs learn the recurrence from data.

Summary

  • Deep learning succeeds through self-organization: simple elements adapting collectively to data
  • Neural networks are compositions of linear maps and nonlinearities: \(\hat{y} = f(W_L\cdots f(W_1\mathbf{x}))\)
  • Backpropagation computes gradients efficiently via the chain rule
  • Architecture encodes inductive bias: CNNs for spatial data, RNNs for sequences
  • Regularization prevents overfitting: weight decay, dropout, early stopping
  • The hidden layer learns features; depth enables hierarchical abstraction
The power of deep learning is not in any specific architecture, but in enabling systems to learn from data autonomously.

References & Resources