Complex Systems

and Probabilistic Modeling

ML for Science - Lecture 6

Chaos, uncertainty, and the bridge from randomness to determinism

Where We Are

So far:

Empirical laws, linear regression
Differential equations (ODEs, PDEs)
Numerical methods: discretization, stability
Simulating physical systems

Today:

What are complex systems?
Chaos and predictability limits
Probability as a tool
From randomness to determinism

Key question: If we know the laws of physics, what's left to discover?

What is a Complex System?

"A system where the whole is much more than the sum of its parts."

Key properties:

Emergence: Global patterns arise from local interactions
Nonlinearity: Understanding parts doesn't mean understanding the whole
Multiple scales: Different behavior at different scales
Sensitivity: Small changes can have large effects

Examples of Complex Systems

Physical

Weather & climate
Fluid turbulence
Granular materials
Protein folding

Biological

Brain / neural systems
Ecosystems
Immune system
Cell signaling

Social/Tech

Social networks
Financial markets
Internet/routing
Cities & traffic

Common thread: We know the rules for the parts, but can't easily predict the whole.
        

Why Complex Systems Matter for ML

When applying ML to science, you often encounter:

Challenges:

High-dimensional, noisy data
Multiple interacting scales
Chaotic dynamics
Missing measurements

Opportunities:

Statistics may be predictable
Patterns emerge at right scale
Neural networks are themselves complex systems!

Rayleigh-Bénard Convection

Fluid heated from below, cooled from above — creates convection cells

Basic mechanism of atmospheric and oceanic circulation

From 7 Equations to 3: The Saltzman-Lorenz Story

Barry Saltzman (Yale, 1961) developed a 7-equation model for convection.

He showed it to Edward Lorenz at MIT - one solution "refused to settle down."

Lorenz noticed: 4 variables quickly became tiny. Only 3 were "keeping each other going."

Lorenz's insight:

                    "Barry gave me the go-ahead signal, and back at MIT the next morning I put the three equations on the computer..."

                    "...and sure enough, there was the same lack of periodicity."

Saltzman-Lorenz Exchange, 1961

The Lorenz Equations

Saltzman's 7 equations reduced to 3:

$\dot{x} = \sigma(y - x)$

$\dot{y} = x(\rho - z) - y$

$\dot{z} = xy - \beta z$

Variables:

$x$ = intensity of convection
$y$ = horizontal temperature diff
$z$ = vertical temperature deviation

Parameters (chaotic regime):

$\sigma = 10$, $\rho = 28$, $\beta = 8/3$

Just 3 coupled ODEs, yet the dynamics are incredibly complex

The Lorenz Attractor

Speed: 3

Trail: 1500

The trajectory never repeats but stays on this strange "butterfly" shape

The Accident

Lorenz wanted to extend a simulation. Instead of starting over, he typed in values from a printout:

Printout showed:

0.506

Computer stored:

0.506127

A difference of about 0.0001 - surely that can't matter?

He goes for coffee, comes back, and...

The Weather is Completely Different!

The two simulations start nearly identical, then completely diverge.

Wait... what's happening here?
The equations are deterministic. Same input should give same output, right?

Sensitivity to Initial Conditions

Perturbation ε = 10^-4

■ Original vs ■ Perturbed by $\varepsilon$

No matter how small $\varepsilon$ is, the trajectories eventually diverge completely.

The Butterfly Effect

"Does the flap of a butterfly's wings in Brazil set off a tornado in Texas?"
- Edward Lorenz, 1972

Lorenz's discovery:

Deterministic $\neq$ Predictable
Tiny errors grow exponentially fast
Long-term weather prediction has a fundamental limit (~2 weeks)

This is chaos: sensitivity to initial conditions in deterministic systems.

Ensemble of Initial Conditions

What happens to a distribution of initial conditions in the Lorenz system?

t = 0.0

A tight cluster of initial conditions spreads across the entire attractor

From Trajectories to Distribution

100 trajectories starting within $10^{-6}$ — sample at time $t$ to get a distribution

Sample time: t = 10

The histogram shows a Probability Mass Function (PMF) — counts in discrete bins

From PMF to PDF

What if $\Delta x \to 0$?

As we use more bins (smaller $\Delta x$), the histogram approaches a smooth curve:

Number of bins: 10

In the limit $\Delta x \to 0$, the PMF becomes a Probability Density Function (PDF)

Estimating PDF from Samples

Click to add samples — each gets a Gaussian "kernel", and they sum to form the estimate

Bandwidth h: 0.6 Samples: 0

$\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right)$ where $K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$ (Gaussian kernel)

Probability Review

(Optional background material)

Key concepts we'll use throughout the course:

Random variables, PMF, PDF
Mean, variance, expectation
Joint & conditional probability
Independence & Bayes' rule
Central Limit Theorem

Random Variables

A random variable $X$ maps outcomes to numbers:

$X: \Omega \to \mathbb{R}$ (sample space to real numbers)

Example: Coin flip

                $\Omega = \{\text{Heads}, \text{Tails}\}$

                $X(\text{Heads}) = 1$

                $X(\text{Tails}) = 0$

Example: Temperature

                $\Omega = $ all possible states

                $X = $ temperature reading

                $X \in \mathbb{R}$ (continuous)

Probability Mass Function (PMF)

For discrete random variables:

$P(X = x)$ = probability that $X$ takes value $x$

Properties:

$P(X = x) \geq 0$
$\sum_x P(X = x) = 1$

Probability Density Function (PDF)

For continuous random variables, the PDF $f(x)$ gives probability via integration:

$P(a \leq X \leq b) = \int_a^b f(x) \, dx$ = shaded area

Key points:

$f(x)$ is the PDF (not a probability!)

$f(x) \geq 0$ and $\int_{-\infty}^{\infty} f(x) dx = 1$

$P(X = x) = 0$ for any single point

Only areas under the curve are probabilities.

Mean and Variance

Mean (Expected Value):

$\mu = E[X] = \int x \cdot f(x) \, dx$

The "center of mass" of the distribution

Variance:

$\sigma^2 = E[(X - \mu)^2]$

How spread out the distribution is

Why this matters: Instead of predicting a single value, we can predict the mean AND quantify our uncertainty (variance).

Joint & Conditional Probability

Joint Probability:

$P(X, Y)$ = probability of both $X$ and $Y$

Conditional Probability:

$P(X | Y) = \frac{P(X, Y)}{P(Y)}$

Example:

                    $X$ = it rains, $Y$ = cloudy

                    $P(\text{rain} | \text{cloudy})$ = probability of rain given it's cloudy

                    Usually $P(\text{rain} | \text{cloudy}) > P(\text{rain})$

Marginalization: $P(X) = \sum_y P(X, Y=y) = \sum_y P(X|Y=y) P(Y=y)$

Independence

Two random variables are independent if knowing one tells you nothing about the other:

$\begin{aligned} X \perp Y \;&\iff\; P(X, Y) = P(X) \cdot P(Y) \\ &\iff\; P(X|Y) = P(X) \end{aligned}$

Independent:

Coin flip 1 and coin flip 2
Weather in Tokyo and weather in Paris

NOT Independent:

Temperature and ice cream sales
Parent height and child height

Independence is a strong assumption — rarely true in real data!

Bayes' Rule

Inverting conditional probabilities:

$P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}$

$P(A)$ — Prior
What we believed before seeing data

$P(B|A)$ — Likelihood
How likely is the data given our hypothesis?

$P(A|B)$ — Posterior
Updated belief after seeing data

$P(B)$ — Evidence
Normalizing constant

Key idea: Bayes' rule tells us how to update beliefs with new evidence. Foundation of Bayesian inference and many ML algorithms.

▶ 3Blue1Brown: Bayes theorem, the geometry of changing beliefs

Central Limit Theorem

One of the most important results in probability:

            The sum (or average) of many independent random variables tends toward a Gaussian distribution, regardless of the original distribution.
        

If $X_1, X_2, \ldots, X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then:

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$ as $n \to \infty$

Why it matters:

Explains why Gaussians are everywhere
Justifies normal approximations
Foundation of statistical inference

Examples:
Measurement errors
Heights of people
Stock price changes (approx)

Brownian Motion

Robert Brown (1827):

Observed pollen grains moving erratically in water under a microscope.

Albert Einstein (1905):

Explained it as evidence for atoms! Tiny molecules randomly bump the particle.

Imagine being pushed randomly by a crowd - you end up doing a "random walk"

Random Walk

At each step, move randomly — left: trajectories, right: density estimate (KDE)

Walkers: 0

Deriving the Diffusion Equation

Consider a 1D random walk on a grid with spacing $\Delta x$ and time step $\Delta t$:

Master Equation: $p(x, t + \Delta t) = \frac{1}{2} p(x - \Delta x, t) + \frac{1}{2} p(x + \Delta x, t)$

Probability at $x$ comes from particles jumping in from both neighbors

The Master Equation

Probability at position $x$ at time $t + \Delta t$ comes from neighbors:

$p(x, t + \Delta t) = \frac{1}{2} p(x - \Delta x, t) + \frac{1}{2} p(x + \Delta x, t)$

Subtracting $p(x, t)$ on both sides:

$\underbrace{p(x, t + \Delta t) - p(x, t)}_{\text{change in time}} = \tfrac{1}{2} \underbrace{\left[ p(x + \Delta x, t) - 2p(x, t) + p(x - \Delta x, t) \right]}_{\text{curvature in space}}$

Taking the Continuum Limit

Taylor expand for small $\Delta x$ and $\Delta t$:

$p(x, t + \Delta t) \approx p + \frac{\partial p}{\partial t} \Delta t$

$p(x \pm \Delta x, t) \approx p \pm \frac{\partial p}{\partial x} \Delta x + \frac{1}{2}\frac{\partial^2 p}{\partial x^2} (\Delta x)^2$

Substituting and simplifying:

$\frac{\partial p}{\partial t} \Delta t = \frac{1}{2} \cdot \frac{\partial^2 p}{\partial x^2} (\Delta x)^2$

The Diffusion Equation: $\displaystyle\frac{\partial p}{\partial t} = D \frac{\partial^2 p}{\partial x^2}$ where $D = \frac{(\Delta x)^2}{2 \Delta t}$

What if $\Delta x \to 0$? Random $dx$ at every $dt$?

The same physics can be written as a Stochastic Differential Equation:

PDE (density)

                    $\frac{\partial p}{\partial t} = D \frac{\partial^2 p}{\partial x^2}$

Evolution of probability density

SDE (trajectory)

$dX = \sqrt{2D}\, dW$

Langevin equation for single particle

$dW$ = Wiener process increment (Gaussian noise with $\langle dW \rangle = 0$, $\langle dW^2 \rangle = dt$)

Key insight: The PDE describes the ensemble, the SDE describes individual realizations. Both are equivalent!

From Randomness to Determinism

Microscopic

Individual particles walk randomly
Completely unpredictable!

Macroscopic

Density evolves deterministically
Perfectly predictable!

This is how we go from randomness at small scales to determinism at large scales.

The Right Scale Matters

A profound lesson for modeling:

If something seems unpredictable, maybe you're looking at the wrong scale.

Individual Scale

Molecules → chaotic
Neuron spikes → noisy
Individual trades → random

Statistical Scale

Temperature/pressure → smooth
Population activity → structured
Market trends → regularities

Complex Systems on Networks

Many complex systems have a network structure:

Internet

                Routers, packets

                Congestion, packet loss

Social

                People, connections

                Information spread

Neural

                Neurons, synapses

                Signal propagation

Common questions: What's the probability of packet loss? How does information spread? How do signals propagate?

How About Neurons to Brains?

The FitzHugh-Nagumo model: a 2D simplification of Hodgkin-Huxley (equivalent to the Van der Pol oscillator circuit)

$\dot{V} = V - \tfrac{V^3}{3} - W + I$
$\dot{W} = \varepsilon(V + a - bW)$

$V$	membrane potential
$W$	recovery variable
$I$	input current
$\varepsilon$	time scale

$I$ controls frequency:

$I = $ 0.50

Flow field shows state evolution. Cubic shape creates excitability: small push → large spike.

Coupled Neurons

Neurons interact through synaptic connections:

$\displaystyle\frac{dV_k}{dt} = V_k - \frac{V_k^3}{3} - W_k + I_k + \sum_{j \in \mathcal{N}(k)} g_{jk}(V_j - V_k)$

Coupling strength $g_{jk}$ determines influence of neuron $j$ on neuron $k$

Synchronization

Explore how coupling strength affects synchronization:

Intrinsic $I_k$

Coupling $g$: 0.000

Try it: At $g=0$ neurons oscillate at their natural frequencies (set by $I_k$). Increase $g$ to see them synchronize.

Preview: Artificial Neural Networks

Artificial neurons are a simplification:

Real neurons:
- Fire in time (spikes)
- Complex ion dynamics
- Many neurotransmitters
- Stochastic

Artificial neurons:
- Static input/output
- Simple: $y = \sigma(Wx + b)$
- Deterministic
- But still a complex system!

Key insight: Neural networks learn because they're complex systems that can adapt to data.

Summary

Complex systems: The whole is more than the sum of parts
Chaos: Deterministic $\neq$ Predictable (Lorenz system)
Probability: A tool for dealing with uncertainty and complexity
Uncertainty propagation: Distributions evolve deterministically
Scale matters: Randomness at small scales can become determinism at large scales (Brownian motion $\to$ diffusion)
Neural networks: Are themselves complex systems that learn from data

Next lecture: Neural networks as function approximators