ML for Science and Engineering — Lecture 11
Joseph Bakarji
Why Does Deep Learning Work?
The success of deep learning reveals two underappreciated principles about intelligence:
Self-organization: intelligence emerges from simple elements that learn to coordinate
Data-driven adaptation: enabling a system to learn from its environment is more powerful than encoding rules
Core Insight Deep networks are collections of simple elements that collaborate to process inputs and predict outputs. Their power comes not from any single element, but from their collective adaptation to data.
Rules vs. the Ability to Learn
Rule-Based Approach
Encode expert knowledge as explicit rules
Define what to do in each situation
Brittle when the environment changes
Cannot adapt to unseen contexts
Learning-Based Approach
Give the system the ability to learn
Let it adapt to its environment autonomously
Flexible in complex, unpredictable settings
More powerful when you cannot predict every case
Do you teach your children rigid rules for every situation, or do you give them the ability to adapt to any context they might encounter?
The Intellectual Milieu of the 1940s
An extraordinary convergence of ideas set the stage for artificial intelligence:
Russell & Whitehead showed in Principia Mathematica (1910-1913) that mathematics derives from formal logic. If logic is universal, and machines can perform logic...
Von Neumann designed the stored-program computer (EDVAC, 1945), using McCulloch-Pitts neurons as the conceptual model.
Norbert Wiener founded cybernetics (1948): feedback, control, and communication in animals and machines.
The reasoning:
Logic can express any computation.
Neurons perform logical operations.
Therefore, networks of neurons can compute anything.
This launched AI as a field.
Walter Pitts and the Dream of Logical Intelligence
Walter Pitts was a self-taught, homeless teenager from Detroit who, at age 12, walked into a lecture by Bertrand Russell and found errors in Principia Mathematica. Russell invited Pitts to study with him.
Pitts met Warren McCulloch, a neurophysiologist asking: how does the brain compute? They combined propositional logic with neural anatomy to create the first formal neuron model.
Their 1943 paper showed that networks of binary threshold units could implement any logical function, founding both neural networks and digital computing theory.
The Macy Conferences (1946-1953)
McCulloch chaired interdisciplinary meetings with von Neumann, Wiener, Pitts, Bateson, and Margaret Mead. Here cybernetics crystallized, and the ideas that would become AI, information theory, and cognitive science were debated.
Click inputs to toggle. Use sliders or gate presets below.
From Theory to Biology: The Squid Giant Axon
While McCulloch-Pitts modeled neurons as logical abstractions, experimentalists studied the real thing. The squid giant axon (1000x thicker than human axons) was large enough to insert electrodes inside.
1939Cole & Curtis develop voltage clamp: hold membrane potential fixed, measure current.
1952Hodgkin & Huxley fit ODEs to voltage clamp data with gating variables \(m, h, n\) governed by first-order kinetics.
A triumph of scientific modeling: Hodgkin and Huxley had no computers. They solved the ODEs by hand with a desk calculator, predicting the action potential shape before it could be measured at full resolution.
\( C_m \frac{dV}{dt} = I_{\text{ext}} - g_{Na} m^3 h (V - E_{Na}) - g_K n^4 (V - E_K) - g_L (V - E_L) \)
The gating variables \(m, h, n \in [0,1]\) model ion channel opening probabilities. Na\(^+\) channels need 3 activation gates (\(m^3\)) and 1 inactivation gate (\(h\)); K\(^+\) channels need 4 gates (\(n^4\)). Each satisfies:
where \(\alpha_x, \beta_x\) are voltage-dependent rate functions that H&H fit to experimental data as combinations of exponentials and rationals, e.g.:
Modern deep learning descends from the McCulloch-Pitts tradition: simplified, computational units whose power comes from collective self-organization, not biological fidelity. The architecture matters less than the ability to adapt.
From Linear Models to Deep Networks
Building the mathematical foundations
The Linear Hypothesis
The simplest supervised learning model assumes a linear relationship between inputs and outputs:
\(\boldsymbol{\phi}(\mathbf{x}) = \;?\;?\;?\) (what if we could learn this?)
Recall: We split data into training, validation, and test sets to select features and evaluate generalization. The question remains: how do we choose \(\boldsymbol{\phi}\)?
Neural networks answer this question by learning the features directly from data.
Linear Predictor: Network View
For multiple inputs and outputs:
\( \hat{\mathbf{y}} = W\mathbf{x} + \mathbf{b} \)
In index notation:
\( \hat{y}_i = \sum_{j=1}^n W_{ij} x_j + b_i \)
where \(W \in \mathbb{R}^{m \times n}\), \(\mathbf{b} \in \mathbb{R}^m\).
Each output is a weighted sum of all inputs. The network diagram makes this structure visible:
Activation Functions
To introduce nonlinearity, we pass the linear output through an activation function \(f\):
Deep networks learn features at increasing levels of abstraction, from simple edges to complex concepts.
Layer 1
Edges, gradients, simple patterns
Layer 2-3
Textures, parts, local combinations
Deeper Layers
Objects, faces, semantic concepts
Reference: Lee et al., "Unsupervised learning of hierarchical representations with convolutional deep belief networks," Communications of the ACM (2011).
\(W \leftarrow W - \alpha \nabla_W \mathcal{L}_{\text{batch}}\)
Momentum
\(v \leftarrow \beta v + \nabla \mathcal{L}\), \(W \leftarrow W - \alpha v\)
Adam
Adaptive learning rates per parameter
Mini-batch SGD: Instead of computing the gradient over the entire dataset, use a random subset (batch) at each step. Faster, and the noise helps escape local minima.
Training in Practice: PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(1, 20)
self.fc2 = nn.Linear(20, 15)
self.fc3 = nn.Linear(15, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
net = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)
for epoch in range(5000):
output = net(x_train)
loss = criterion(output, y_train)
loss.backward() # Backpropagation
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients
Universal Approximation in Action
A single hidden layer with enough neurons can approximate any continuous function (Cybenko, 1989; Hornik et al., 1989). The network trains via the SGD loop above. Increase neurons to improve the fit:
3
Architecture: 1 → N → 1 (sigmoid activations)
Training: 800 epochs of SGD on 30 sample points, LR = 0.005
Loss (MSE):—
Overfitting and Regularization
When a model memorizes training data but fails on new data:
Dropout: randomly zero out neurons during training
Early stopping: stop when validation loss starts increasing
Data augmentation: increase effective dataset size
# Weight decay in PyTorch
optimizer = optim.Adam(net.parameters(),
lr=0.01, weight_decay=1e-3)
Beyond the Basics
Architectures that encode structure
The Neural Network Zoo
Different architectures encode different inductive biases about the structure of data:
Architecture
Bias / Assumption
Application
Fully Connected (MLP)
No structure assumed
Tabular data, function approximation
Convolutional (CNN)
Spatial locality, translation invariance
Images, spatial data
Recurrent (RNN)
Sequential dependence
Time series, language
Autoencoder
Low-dimensional latent structure
Compression, denoising
Transformer
Attention over all positions
Language, vision, multimodal
Auto-Encoders
Learn a compressed representation by training the network to reconstruct its own input. The bottleneck forces the network to discover the most important features.
If \(f, g\) are linear, the AE learns the SVD: same subspace as PCA
Example: MNIST denoising. Train on corrupted digits, reconstruct clean ones. The 2D latent space \(\mathbf{z} \in \mathbb{R}^2\) clusters digits by identity — the network discovers digit features without labels.
Convolutional Neural Networks
Instead of connecting every input to every hidden unit, CNNs use local filters (kernels) that slide across the input. A 2D convolution:
\(\mathbf{x}_t\) — input at time \(t\); \(\mathbf{h}_t\) — hidden state (memory)
\(W_h, W_x, W_y\) — shared across all time steps
\(\hat{\mathbf{y}}_t\) — prediction at time \(t\) (e.g. next value \(\mathbf{x}_{t+1}\))
ODE connection: Euler's method \(\mathbf{y}_{n+1} = \mathbf{y}_n + h\,f(\mathbf{y}_n)\) is a recurrence with fixed "weights." RNNs learn the recurrence from data.
Summary
Deep learning succeeds through self-organization: simple elements adapting collectively to data
Neural networks are compositions of linear maps and nonlinearities: \(\hat{y} = f(W_L\cdots f(W_1\mathbf{x}))\)
Backpropagation computes gradients efficiently via the chain rule
Architecture encodes inductive bias: CNNs for spatial data, RNNs for sequences
Regularization prevents overfitting: weight decay, dropout, early stopping
The hidden layer learns features; depth enables hierarchical abstraction
The power of deep learning is not in any specific architecture, but in enabling systems to learn from data autonomously.
References & Resources
McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity" (1943)
Hodgkin & Huxley, "A quantitative description of membrane current..." J. Physiol. (1952)
Goodfellow, Bengio, Courville. Deep Learning. MIT Press (2016).