Learning Differential Equations
from Data

ML for Science and Engineering — Lecture 8

Where We Are

So far we have modeled time series by mapping previous steps to next steps:

$x_{i+1} = f(x_i, x_{i-1}, \dots)$

Autoregressive models, ARIMA, function fitting on lagged variables.

Today: a fundamentally different approach. Model the derivative:

$\dot{x} = f(x)$

Discover the differential equation that generates the data.

Why Model Derivatives?

This is what Newton did. This is what everyone in the sciences has done for quite some time.

            A differential equation $\dot{x} = f(x)$ says that the same rule $f$ governs the system at every moment in time. It generalizes in time.
        

The dual relationship: In traditional physics, $f$ is known and $x$ is unknown (forward problem). Here, $x$ is known and $f$ is unknown (inverse problem).

A Brief History

Discovering equations from data is not new:

1980s — Crutchfield & McNamara (1987): early equation discovery from time series
2016 — Brunton, Proctor & Kutz, PNAS: SINDy (Sparse Identification of Nonlinear Dynamics)
2017 — Rudy et al., Science Advances: PDE-FIND extends to partial differential equations

The key idea: cast equation discovery as sparse linear regression over a library of candidate terms.

Case Study: Predator-Prey Dynamics

Hudson Bay Company data (1900-1920): hare and lynx pelts traded.

● Hare (prey, $x$) ● Lynx (predator, $y$)

Clear oscillations, two coupled variables. Can we discover the governing equations $\dot{x} = f(x,y)$, $\dot{y} = g(x,y)$ from the data alone?

Proposing a Hypothesis Class

We have two variables $x$ (prey) and $y$ (predator). Assume the dynamics can be written as a linear combination of candidate terms:

$\dot{x} = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 x^2 + \theta_4 y^2 + \theta_5 xy + \theta_6 \sin(x) + \cdots$

We don't know which terms matter. We include many candidates and let the algorithm select the active ones.

Sneak peek: the true equations turn out to be $\dot{x} = \alpha x - \beta xy$ and $\dot{y} = \delta xy - \gamma y$. Only a few terms survive out of many candidates!
        

The Feature Library

The library encodes our hypothesis class: which terms could appear in the equation?

Library choice	What it can discover
$[1, x, x^2, x^3]$	Polynomial ODEs
$[x, \sin(x), \cos(x)]$	Trigonometric dynamics (e.g. pendulum)
$[x, y, xy, x^2, y^2]$	Coupled 2D systems (e.g. Lotka-Volterra)
$[u, u_x, u_{xx}, u \cdot u_x]$	PDEs (diffusion, Burgers, etc.)

More features = more expressive, but harder to identify the right terms. This is where sparsity comes in.

Approximating $\dot{x}$ from Data

We need $\dot{x}$, but all we have are discrete samples $x(t_1), x(t_2), \dots, x(t_n)$.

Forward difference:

$\dot{x}(t_i) \approx \frac{x(t_{i+1}) - x(t_i)}{\Delta t}$

Central difference:

$\dot{x}(t_i) \approx \frac{x(t_{i+1}) - x(t_{i-1})}{2\Delta t}$

Noise: 0.05

▬ True ▬ Forward ▬ Central

Differentiation Methods

Method	Accuracy	Noise sensitivity	Notes
Forward difference	$O(\Delta t)$	High	Simplest
Central difference	$O(\Delta t^2)$	High	Better accuracy
Savitzky-Golay	Depends on window	Low	Smoothing + differentiation
Fourier / spectral	Spectral	Medium	Global, periodic signals
Total variation	Regularized	Low	For very noisy data

From Data Points to Equations

Focus on $x$ with a quadratic hypothesis: $\dot{x}(t_i) = \theta_0 + \theta_1 \, x(t_i) + \theta_2 \, x(t_i)^2$

Each data point gives one equation with the same unknowns $\theta_0, \theta_1, \theta_2$. Click to add:

Add Point Reset

The System in Matrix Form

Stacking all $n$ equations from the data points:

$\underbrace{\begin{bmatrix} \dot{x}(t_1) \\ \dot{x}(t_2) \\ \dot{x}(t_3) \\ \vdots \\ \dot{x}(t_n) \end{bmatrix}}_{\dot{\mathbf{X}}} = \underbrace{\begin{bmatrix} 1 & x(t_1) & x(t_1)^2 \\ 1 & x(t_2) & x(t_2)^2 \\ 1 & x(t_3) & x(t_3)^2 \\ \vdots & \vdots & \vdots \\ 1 & x(t_n) & x(t_n)^2 \end{bmatrix}}_{\boldsymbol{\Phi}} \underbrace{\begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}}_{\boldsymbol{\theta}}$

$n$ equations, $3$ unknowns. When $n \gg 3$: overdetermined. This is just linear regression: $\dot{\mathbf{X}} = \boldsymbol{\Phi}\,\boldsymbol{\theta}$.

The General Feature Matrix $\boldsymbol{\Phi}\big(\mathbf{x}(t)\big)$

For an arbitrary library $\phi_0, \phi_1, \dots, \phi_m$:

$\underbrace{\begin{bmatrix} \dot{x}(t_1) \\ \dot{x}(t_2) \\ \vdots \\ \dot{x}(t_n) \end{bmatrix}}_{\dot{\mathbf{X}}} = \underbrace{\begin{bmatrix} \phi_0(x_1) & \phi_1(x_1) & \cdots & \phi_m(x_1) \\ \phi_0(x_2) & \phi_1(x_2) & \cdots & \phi_m(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(x_n) & \phi_1(x_n) & \cdots & \phi_m(x_n) \end{bmatrix}}_{\boldsymbol{\Phi}\big(\mathbf{x}(t)\big)} \underbrace{\begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_m \end{bmatrix}}_{\boldsymbol{\theta}}$

Compact form: $\dot{\mathbf{X}} = \boldsymbol{\Phi}\big(\mathbf{x}(t)\big) \, \boldsymbol{\theta}$. Each $\phi_j$ could be $x$, $x^2$, $\sin(x)$, $xy$, etc.

The Least Squares Solution

Minimize the squared error between predicted and observed derivatives:

$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \left\| \boldsymbol{\Phi} \, \boldsymbol{\theta} - \dot{\mathbf{X}} \right\|_2^2$

Closed-form solution via the normal equations:

$\boldsymbol{\theta}^* = \left(\boldsymbol{\Phi}^\top \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^\top \dot{\mathbf{X}}$

This gives the best fit using all features. But many coefficients will be small and spurious. We need sparsification.

Generalizing to Vector Systems

When $\mathbf{x} = (x_1, x_2, \dots, x_d)$ is a vector, each component gets its own equation but shares the same library:

$\dot{x}_k = \boldsymbol{\theta}_k^\top \boldsymbol{\phi}(\mathbf{x}), \quad k = 1, \dots, d$

Compact matrix form:

$\underbrace{\dot{\mathbf{X}}}_{\mathbb{R}^{n \times d}} = \underbrace{\boldsymbol{\Phi}(\mathbf{X})}_{\mathbb{R}^{n \times p}} \; \underbrace{\boldsymbol{\Theta}}_{\mathbb{R}^{p \times d}}$

Where $n$ = data points, $d$ = state dimension, $p$ = library size. Sparsify each column of $\boldsymbol{\Theta}$ independently.

Approaches to Sparsity

LASSO ($\ell_1$ regularization)

$\min_\theta \|\Phi\theta - \dot{X}\|_2^2 + \lambda \|\theta\|_1$

The $\ell_1$ penalty pushes small coefficients to exactly zero. We saw this with regularization.

Sequential Thresholding (STLS)

Repeat: solve, then set $|\theta_j| < \lambda$ to zero

The approach used in the original SINDy paper. Simple, effective, interpretable.

SINDy in Python


import numpy as np
# X: (n, 2) state measurements, dt: time step, threshold: sparsity knob
dXdt = (X[2:] - X[:-2]) / (2 * dt)        # derivatives (central diff)

# Feature library: [1, x, y, x², y², xy]
x, y = X[1:-1, 0], X[1:-1, 1]
Phi = np.column_stack([np.ones_like(x), x, y, x**2, y**2, x*y])

Theta = np.linalg.lstsq(Phi, dXdt, rcond=None)[0]   # least squares

# Sequential thresholding
for _ in range(10):
    small = np.abs(Theta) < threshold
    Theta[small] = 0
    for j in range(2):                     # one column per variable
        active = ~small[:, j]
        if active.any():
            Theta[active, j] = np.linalg.lstsq(
                Phi[:, active], dXdt[:, j], rcond=None)[0]

SINDy: Step by Step

Step Run All Reset $\lambda$: 0.30

noise: 0.08 Synthetic Real Data

Step 1: Build feature library Phi = [1, x, y, x², y², xy] # n data points × 6 features

Step 2: Solve least squares theta = lstsq(Phi, X_dot) # best fit using ALL features

Step 3: Threshold small coefficients theta[|theta| < λ] = 0 # enforce sparsity

Step 4: Re-solve for active terms active = |theta| > 0 theta[active] = lstsq(Phi[:,active], X_dot) # if changed, goto Step 2

Finding the Right Solution

SINDy converges, but is the answer correct? Several factors make this harder than it looks:

1. Noise corrupts derivatives
Small errors in $x(t)$ become large errors in $\dot{x}$. Noisy derivatives bias the least-squares solution and can activate spurious terms.

2. $\lambda$ is a hyperparameter
Too large $\rightarrow$ everything is zeroed out (trivial $\dot{x}=0$).
Too small $\rightarrow$ spurious terms survive (overfitting).

How do we measure "good"?

                    Training error (MSE on data used to fit) always decreases with more terms. We need a metric that captures generalization.

Train / test split
Hold out the last portion of the time series. Fit on training data, evaluate on held-out test data. The best $\lambda$ minimizes test error.

Hyperparameter Tuning: $\lambda$ vs MSE

$\lambda$: 0.30 Run SINDy Clear

noise: 0.08

Each dot = one SINDy run at that $\lambda$.
■ Train MSE ■ Test MSE

Part 2: From ODEs to PDEs

ML for Science and Engineering — Lecture 9

From ODEs to PDEs

What if our variable depends on space and time? $u(x, t)$

$\frac{\partial u}{\partial t} = f\!\left(u, \; \frac{\partial u}{\partial x}, \; \frac{\partial^2 u}{\partial x^2}, \; u \frac{\partial u}{\partial x}, \; \dots \right)$

Same idea: the feature library now includes spatial derivatives:

$\boldsymbol{\phi} = [1, \; u, \; u_x, \; u_{xx}, \; u^2, \; u \cdot u_x, \; u_{xxx}, \; \dots]$

Building the PDE Feature Matrix

At each grid point $(x_i, t_j)$, compute all candidate terms:

$u_t^{(i,j)} = \theta_1 \cdot u^{(i,j)} + \theta_2 \cdot u_x^{(i,j)} + \theta_3 \cdot u_{xx}^{(i,j)} + \theta_4 \cdot u^{(i,j)} u_x^{(i,j)} + \cdots$

Stack all interior grid points into one system. After sparsification:

Discover: $u_t = \alpha \, u_{xx}$ (the diffusion equation). Only $u_{xx}$ survives.

From Grid to System of Equations

Click interior points to build the system.

$u_t^{(i,j)} \approx \frac{u^{(i,j+1)} - u^{(i,j)}}{\Delta t}, \quad u_x^{(i,j)} \approx \frac{u^{(i+1,j)} - u^{(i-1,j)}}{2\Delta x}, \quad u_{xx}^{(i,j)} \approx \frac{u^{(i+1,j)} - 2u^{(i,j)} + u^{(i-1,j)}}{\Delta x^2}$

System of equations:

Reset

Stacking into a Linear System

Each interior grid point gives one equation. Stack them all:

$\underbrace{\begin{bmatrix} u_t^{(2,1)} \\ u_t^{(3,1)} \\ u_t^{(2,2)} \\ \vdots \\ u_t^{(i,j)} \\ \vdots \end{bmatrix}}_{\dot{\mathbf{U}}} = \underbrace{\begin{bmatrix} u_x^{(2,1)} & u_{xx}^{(2,1)} \\ u_x^{(3,1)} & u_{xx}^{(3,1)} \\ u_x^{(2,2)} & u_{xx}^{(2,2)} \\ \vdots & \vdots \\ u_x^{(i,j)} & u_{xx}^{(i,j)} \\ \vdots & \vdots \end{bmatrix}}_{\boldsymbol{\Phi}} \underbrace{\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}}_{\boldsymbol{\theta}}$

Same structure as the ODE case: $\dot{\mathbf{U}} = \boldsymbol{\Phi}\,\boldsymbol{\theta}$. Solve with least squares, then sparsify. The surviving terms reveal the PDE.

PDEs You Can Discover from Data

With a richer library $[u, u_x, u_{xx}, u \cdot u_x, u_{xxx}, \dots]$, the same approach discovers many classical PDEs:

Equation	Surviving features	Physics
Diffusion: $u_t = \alpha u_{xx}$	$u_{xx}$	Heat conduction
Advection: $u_t = -c \, u_x$	$u_x$	Wave transport
Burgers: $u_t = \nu u_{xx} - u \, u_x$	$u_{xx}$, $u \cdot u_x$	Shock formation
KdV: $u_t = -u \, u_x + \delta \, u_{xxx}$	$u \cdot u_x$, $u_{xxx}$	Solitons

The method is not limited to dynamics or nonlinear equations. It is really Sparse Identification of Differential Equations from data.

Rudy, Brunton, Proctor & Kutz, Science Advances (2017) — PDE-FIND: the extension of SINDy to partial differential equations.

Practical Considerations

Challenges

Noisy derivatives degrade results
Threshold $\lambda$ is a hyperparameter
Library must contain the right terms
Doesn't scale to very high-dimensional systems

Strengths

Interpretable: output is an equation
Simple: linear regression + thresholding
Computationally cheap
Works with moderate data

SINDy in Context

Approach	Hypothesis class	Interpretable?	Data needed
Neural network	Any continuous function	No	Lots
SINDy	Sparse combination of known terms	Yes	Moderate
Symbolic regression	Arbitrary expressions	Yes	Moderate
Physics-informed NN	NN constrained by known PDE	Partially	Less

            SINDy's appeal: very easy to understand. It's linear regression with sparsification. That's all it is.
        

Summary

1. Derivative
Compute $\dot{x}$ from data.
Handle noise carefully.
        

2. Library
Build $\boldsymbol{\Phi}(\mathbf{x})$ from candidate terms.
Encode your hypothesis.

3. Sparsify
Solve + threshold iteratively.
Discover the equation.

$\dot{\mathbf{X}} = \boldsymbol{\Phi}(\mathbf{X}) \, \boldsymbol{\theta} \quad \longrightarrow \quad$ discover $\boldsymbol{\theta}$ (sparse)

Brunton et al., PNAS (2016) | Rudy et al., Science Advances (2017)

Looking Ahead

Next lecture: Symbolic regression — what if the library itself is unknown?
Project idea: Find real-world time series data, apply SINDy, discover governing equations

Homework: Apply SINDy to the predator-prey data. Discover the governing equations. Extend to a PDE (diffusion) using finite differences.

Learning Differential Equationsfrom Data